Re: SOLVED: injector in nutch-1.4

2011-10-19 Thread Radim Kolar
error was caused by incorrect entry in domain-urlfilter i had there 
.cz and it should be only cz


Re: How does nutch handles javaScript in href

2011-10-19 Thread Marek Bachmann

So, I figured out, that they are not discarded.

Let's take this URL for example:

http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef

This page is not found. I used the linkdb to determine why this deadlink 
is in the crawldb. The result:


./nutch readlinkdb linkdb -url 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef;
11/10/19 01:29:52 INFO util.NativeCodeLoader: Loaded the native-hadoop 
library
11/10/19 01:29:52 INFO zlib.ZlibFactory: Successfully loaded  
initialized native-zlib library

11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
11/10/19 01:29:52 INFO compress.CodecPool: Got brand-new decompressor
fromUrl: http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor:
fromUrl: http://www.uni-kassel.de/intranet/footernavi/bildnachweis.html 
anchor:

fromUrl: http://www.uni-kassel.de/intranet/footernavi/sitemap.html anchor:

I took the first page 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html and run 
ParserChecker on it. This is the result:


./nutch org.apache.nutch.parse.ParserChecker 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html;
11/10/19 13:58:02 INFO parse.ParserChecker: fetching: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html
11/10/19 13:58:02 WARN plugin.PluginRepository: Plugins: directory not 
found: ${job.local.dir}/../jars/plugins
11/10/19 13:58:02 INFO plugin.PluginRepository: Plugins: looking in: 
/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/plugins

(...)
11/10/19 13:58:02 INFO http.Http: http.proxy.host = null
11/10/19 13:58:02 INFO http.Http: http.proxy.port = 8080
11/10/19 13:58:02 INFO http.Http: http.timeout = 1
11/10/19 13:58:02 INFO http.Http: http.content.limit = 10485760
11/10/19 13:58:02 INFO http.Http: http.agent = Uni Kassel 
Spider/Nutch-1.3 (Test Crawler des ITS der Uni Kassel)
11/10/19 13:58:02 INFO http.Http: http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
11/10/19 13:58:05 INFO conf.Configuration: found resource 
tika-mimetypes.xml at 
file:/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/tika-mimetypes.xml
11/10/19 13:58:05 INFO parse.ParserChecker: parsing: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html
11/10/19 13:58:05 INFO parse.ParserChecker: contentType: 
application/xhtml+xml
11/10/19 13:58:05 INFO conf.Configuration: found resource 
parse-plugins.xml at 
file:/tmp/hadoop-nutch/hadoop-unjar8228180125857982003/parse-plugins.xml
11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file 
does not claim to support contentType: application/xhtml+xml

-
Url
---
http://www.uni-kassel.de/intranet/footernavi/redaktion.html-
ParseData
-
Version: 5
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php 
anchor:
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef 
anchor:
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html#nav anchor: 
Skip to navigation (Press Enter).
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html#col3 anchor: 
Skip to main content (Press Enter).
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/metanavi/zur-uni-startseite.html 
anchor: zur Uni-Startseite
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus.html anchor: 
Intranet
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html anchor: 
Redaktion
  outlink: toUrl: http://www.uni-kassel.de/ anchor: Logo der 
Universität Kassel
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/aktuelles/aktuelles-aus.html anchor: 
Aktuelles
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/themen/ueberblick.html anchor: Themen
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/abteilungen/ueberblick.html anchor: 
Abteilungen
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/organisation/ueberblick.html anchor: 
Organisation
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/schnelleinstieg/ueberblick.html 
anchor: Schnelleinstieg
  outlink: toUrl: 

build nutch-1.3 from src/plugin

2011-10-19 Thread Ashish Mehrotra
After trying to build nutch-1.3 from source unsuccessfully from Mac, I tried it 
from a Linux X86 machine. Making ant build from top level works fine. Makes the 
classes and runtime folders.After that when I go to src/plugin and try to fire 
ant from there, I see issues like --
Problem: failed to create task or type antlib:org.apache.ivy.ant:settings
This appears to be an antlib declaration. Action: Check that the implementing 
library exists in one of:        -/home/ashish/utils/apache-ant-1.8.2/lib       
 -/home/ashish/.ant/lib
If I copy ivy.jar into the location, I start getting issue like --ivy:resolve 
doesn't support the log attribute
Question - Should nutch always be built from the NUTCH_HOME and plugins should 
not be tried to be built separately from the src/plugin folder ? 


Re: How does nutch handles javaScript in href

2011-10-19 Thread lewis john mcgibbney
Hi Marek,

This is v. interesting and I am looking forward to hearing from anyone with
similar problems. Unfortunately I've not experienced this behaviour, however
it is clearly a significant problem as you point out. Ultimately it should
be ironed out.

What a great tool the ParserChecker is.

11/10/19 13:58:05 INFO parse.ParserChecker: parsing:
 http://www.uni-kassel.de/intranet/footernavi/redaktion.html
 11/10/19 13:58:05 INFO parse.ParserChecker: contentType:
 application/xhtml+xml
 11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml
 at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**
 parse-plugins.xml
 11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:
 org.apache.nutch.parse.html.HtmlParser mapped to contentType
 application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
 not claim to support contentType: application/xhtml+xml


This indicates that parse-html was not used and the default for wildcard
contentType defaults to parse-tika... am I correct here?

If this is the case then it means that parse-tika is not dealing with the
problem as you describe it. However I must also comment, that we recently
committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt
with application/xhtml+xml material. It would be interesting to see if
parse-html in trunk-1.4 deals with this now. If not then I think this needs
to be filed as a JIRA issue and dealt with appropriately.

Can you please check and get back to us...

Thanks

Lewis


Re: build nutch-1.3 from src/plugin

2011-10-19 Thread lewis john mcgibbney
Hi,

In my experience I have never 'needed' to build from anywhere else that
NUTCH_HOME. However I would imagine that this is not always the case in some
production environments.

I think the method you describe for building plugins works slightly against
the way we currently do this, which is

Independent plugin management from NUTCH_HOME/src/plugin Vs centralised
plugin management via build.xml from NUTCH_HOME

I would suggest one possible work around... and I apologise if this is
slightly off topic. You can comment out the 'build  deploy', 'test' and
'clean' targets for the plugins you do not wish to build within
NUTCH_HOME/src/plugin/build.xml. This will enable you to only (control)
build the plugins you desire from NUTCH_HOME/build.xml

As I said, sorry if my comments are off topic in any way.

Lewis

On Wed, Oct 19, 2011 at 1:27 PM, Ashish Mehrotra ashme...@yahoo.com wrote:

 After trying to build nutch-1.3 from source unsuccessfully from Mac, I
 tried it from a Linux X86 machine. Making ant build from top level works
 fine. Makes the classes and runtime folders.After that when I go to
 src/plugin and try to fire ant from there, I see issues like --
 Problem: failed to create task or type antlib:org.apache.ivy.ant:settings
 This appears to be an antlib declaration. Action: Check that the
 implementing library exists in one of:
 -/home/ashish/utils/apache-ant-1.8.2/lib-/home/ashish/.ant/lib
 If I copy ivy.jar into the location, I start getting issue like
 --ivy:resolve doesn't support the log attribute
 Question - Should nutch always be built from the NUTCH_HOME and plugins
 should not be tried to be built separately from the src/plugin folder ?




-- 
*Lewis*


a plugin to select the re-crawl date of a page

2011-10-19 Thread mathieu lacage
hi,

I am looking into nutch to try to crawl a couple of forum-based websites and
I would like to avoid writing scripts to generate lists of urls to perform
daily incremental crawls. Instead, I suspect that I should be able to write
a plugin for nutch which is able to associate with each url the date of the
next crawl so that nutch generate does the right thing and picks the urls
which need to be refreshed, hence picking new messages in live/recent
discussions as well as whole new discussions.

I have started to dive into the code to figure out how I might be able to do
pull this off but I suspect that someone more knowledgeable with the
structure of nutch itself could give me hints as to where to look, hence
saving me quite a bit of time.

Mathieu

-- 
Mathieu Lacage mathieu.lac...@alcmeon.com


Re: How does nutch handles javaScript in href

2011-10-19 Thread Marek Bachmann

On 19.10.2011 14:34, lewis john mcgibbney wrote:

Hi Marek,

This is v. interesting and I am looking forward to hearing from anyone with
similar problems. Unfortunately I've not experienced this behaviour, however
it is clearly a significant problem as you point out. Ultimately it should
be ironed out.

What a great tool the ParserChecker is.

11/10/19 13:58:05 INFO parse.ParserChecker: parsing:

http://www.uni-kassel.de/intranet/footernavi/redaktion.html
11/10/19 13:58:05 INFO parse.ParserChecker: contentType:
application/xhtml+xml
11/10/19 13:58:05 INFO conf.Configuration: found resource parse-plugins.xml
at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**
parse-plugins.xml
11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does
not claim to support contentType: application/xhtml+xml



This indicates that parse-html was not used and the default for wildcard
contentType defaults to parse-tika... am I correct here?


According to my parse-plugins.xml, yes:

  !--  by default if the mimeType is set to *, or
if it can't be determined, use parse-tika --
mimeType name=*
  plugin id=parse-tika /
/mimeType

BUT:

I added LOG.info(This is HtmlParser); to the first line in getParse in 
HtmlParser.java and compiled it. After that I got:


(...)
11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file 
does not claim to support contentType: application/xhtml+xml


11/10/19 15:20:08 INFO parse.html: This is HtmlParser

-
Url
---
http://www.uni-kassel.de/intranet/footernavi/redaktion.html-
ParseData
-
Version: 5
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php 
anchor:
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:

(...)

As I understand this, the HtmlParser IS used and NOT Tika?




If this is the case then it means that parse-tika is not dealing with the
problem as you describe it. However I must also comment, that we recently
committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html dealt
with application/xhtml+xml material. It would be interesting to see if
parse-html in trunk-1.4 deals with this now. If not then I think this needs
to be filed as a JIRA issue and dealt with appropriately.

Can you please check and get back to us...

Thanks

Lewis





Re: Re: How does nutch handles javaScript in href

2011-10-19 Thread lewis . mcgibbney
Then in my own opinion there is no existing code within parse-html which  
prevents it from parsing the anchor snippts you've posted.


This would make a great addition to the parse-html as it seems to be an  
unforseen boundary case that we should not ignore.


If you don't get feedback on this, can I ask for you to open a JIRA ticket  
based upon your understanding of the situation?


Thank you

On , Marek Bachmann m.bachm...@uni-kassel.de wrote:

On 19.10.2011 14:34, lewis john mcgibbney wrote:




Hi Marek,




This is v. interesting and I am looking forward to hearing from anyone  
with


similar problems. Unfortunately I've not experienced this behaviour,  
however



it is clearly a significant problem as you point out. Ultimately it should



be ironed out.





What a great tool the ParserChecker is.





11/10/19 13:58:05 INFO parse.ParserChecker: parsing:




http://www.uni-kassel.de/intranet/footernavi/redaktion.html



11/10/19 13:58:05 INFO parse.ParserChecker: contentType:



application/xhtml+xml


11/10/19 13:58:05 INFO conf.Configuration: found resource  
parse-plugins.xml



at file:/tmp/hadoop-nutch/hadoop-**unjar8228180125857982003/**



parse-plugins.xml



11/10/19 13:58:05 WARN parse.ParserFactory: ParserFactory:Plugin:



org.apache.nutch.parse.html.HtmlParser mapped to contentType



application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does



not claim to support contentType: application/xhtml+xml








This indicates that parse-html was not used and the default for wildcard



contentType defaults to parse-tika... am I correct here?






According to my parse-plugins.xml, yes:






if it can't be determined, use parse-tika --











BUT:




I added LOG.info(This is HtmlParser); to the first line in getParse in  
HtmlParser.java and compiled it. After that I got:





(...)


11/10/19 15:20:08 WARN parse.ParserFactory: ParserFactory:Plugin:  
org.apache.nutch.parse.html.HtmlParser mapped to contentType  
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does  
not claim to support contentType: application/xhtml+xml





11/10/19 15:20:08 INFO parse.html: This is HtmlParser





-



Url



---



http://www.uni-kassel.de/intranet/footernavi/redaktion.html-



ParseData



-



Version: 5



Status: success(1,0)



Title: Intranet: Redaktion



Outlinks: 23


outlink: toUrl:  
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php  
anchor:


outlink: toUrl:  
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef  
anchor:



(...)





As I understand this, the HtmlParser IS used and NOT Tika?










If this is the case then it means that parse-tika is not dealing with the



problem as you describe it. However I must also comment, that we recently


committed Ferdy's NUTCH-1097 for trunk-1.4 which meant that parse-html  
dealt



with application/xhtml+xml material. It would be interesting to see if


parse-html in trunk-1.4 deals with this now. If not then I think this  
needs



to be filed as a JIRA issue and dealt with appropriately.





Can you please check and get back to us...





Thanks





Lewis










not able to parse adobe 9.0 pdfs using 1.3 tika parser

2011-10-19 Thread digho
These pdfs were not getting parsed with parse-pdf plugin of nutch 1.2.
So, tried with 1.3. Saw that even simple and old pdfs also not working.

my code (TestParse.java):

bash-2.00$ cat TestParse.java
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseStatus;
import org.apache.nutch.parse.ParseUtil;
import org.apache.nutch.parse.ParseData;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

public class TestParse {

private static Configuration conf = NutchConfiguration.create();

public TestParse() {
}

public static void main(String[] args) {
String filename = args[0];
convert(filename);
}

public static String convert(String fileName) {
String newName = abc.html;

try {
System.out.println(Converting  + fileName +  to html.);
if (convertToHtml(fileName, newName))
return newName;
} catch (Exception e) {
(new File(newName)).delete();
System.out.println(General exception  + e.getMessage());
}
return null;
}

private static boolean convertToHtml(String fileName, String newName)
throws Exception {
// Read the file
FileInputStream in = new FileInputStream(fileName);
byte[] buf = new byte[in.available()];
in.read(buf);
in.close();

// Parse the file
Content content = new Content(file: + fileName, file: +
fileName,
  buf, , new Metadata(), conf);
ParseResult parseResult = new ParseUtil(conf).parse(content);
parseResult.filter();
if (parseResult.isEmpty()) {
System.out.println(All parsing attempts failed);
return false;
}
IteratorMap.Entrylt;Text,Parse iterator =
parseResult.iterator();
if (iterator == null) {
System.out.println(Cannot iterate over successful parse
results);
return false;
}
Parse parse = null;
ParseData parseData = null;
while (iterator.hasNext()) {
parse = parseResult.get((Text)iterator.next().getKey());
parseData = parse.getData();
ParseStatus status = parseData.getStatus();

// If Parse failed then bail
if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
System.out.println(Could not parse  + fileName + .  +
status.getMessage());
return false;
}
}

// Start writing to newName
FileOutputStream fout = new FileOutputStream(newName);
PrintStream out = new PrintStream(fout, true, UTF-8);

// Start Document
out.println(html);

// Start Header
out.println(head);

// Write Title
String title = parseData.getTitle();
if (title != null  title.trim().length()  0) {
out.println(title + parseData.getTitle() + /title);
}

// Write out Meta tags
Metadata metaData = parseData.getContentMeta();
String[] names = metaData.names();
for (String name : names) {
String[] subvalues = metaData.getValues(name);
String values = null;
for (String subvalue : subvalues) {
values += subvalue;
}
if (values.length()  0)
out.printf(meta name=\%s\ content=\%s\/\n,
   name, values);
}
out.println(meta http-equiv=\Content-Type\
content=\text/html;charset=UTF-8\/);
// End Meta tags

out.println(/head); // End Header

// Start Body
out.println(body);
out.print(parse.getText());
out.println(/body); // End Body

out.println(/html); // End Document

out.close(); // Close the file

return true;
}
}


command:
==
bash-2.00$ java -classpath
conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.jar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-core-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.jar:.
TestParse direct.pdf
==

output:
_
Converting direct.pdf to html.
Oct 19, 2011 5:05:19 PM org.apache.hadoop.conf.Configuration
getConfResourceAsInputStream
INFO: found resource tika-mimetypes.xml at
file:/path/to/nutch/1.3/conf/tika-mimetypes.xml
Oct 19, 2011 5:05:20 PM org.apache.nutch.plugin.PluginManifestParser
parsePluginFolder
INFO: Plugins: looking in: 

Re: a plugin to select the re-crawl date of a page

2011-10-19 Thread Markus Jelsma
Hi,

If you use a low default fetch interval for newly discovered pages they will 
be fetched very frequently. If you combine with an adaptive fetch scheduler 
and using text profiling those pages that do not change anymore will start to 
be fetched less and less, giving room for new pages.

Check the Nutch configuration for settings and descriptions. 
AdaptiveFetchSchedule and TextProfile are keywords.

Cheers

 hi,
 
 I am looking into nutch to try to crawl a couple of forum-based websites
 and I would like to avoid writing scripts to generate lists of urls to
 perform daily incremental crawls. Instead, I suspect that I should be able
 to write a plugin for nutch which is able to associate with each url the
 date of the next crawl so that nutch generate does the right thing and
 picks the urls which need to be refreshed, hence picking new messages in
 live/recent discussions as well as whole new discussions.
 
 I have started to dive into the code to figure out how I might be able to
 do pull this off but I suspect that someone more knowledgeable with the
 structure of nutch itself could give me hints as to where to look, hence
 saving me quite a bit of time.
 
 Mathieu


Re: not able to parse adobe 9.0 pdfs using 1.3 tika parser

2011-10-19 Thread Markus Jelsma
There's always trouble with PDF parsing. Try trunk, it has an upgraded Tika 
including PDF parse improvements. Utimately problems with parsing should be 
addressed at the Tika ML or even PDFBox list.

 These pdfs were not getting parsed with parse-pdf plugin of nutch 1.2.
 So, tried with 1.3. Saw that even simple and old pdfs also not working.
 
 my code (TestParse.java):
 
 bash-2.00$ cat TestParse.java
 import java.io.File;
 import java.io.FileInputStream;
 import java.io.FileOutputStream;
 import java.io.PrintStream;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Map.Entry;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.io.Text;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.ParseUtil;
 import org.apache.nutch.parse.ParseData;
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 
 public class TestParse {
 
 private static Configuration conf = NutchConfiguration.create();
 
 public TestParse() {
 }
 
 public static void main(String[] args) {
 String filename = args[0];
 convert(filename);
 }
 
 public static String convert(String fileName) {
 String newName = abc.html;
 
 try {
 System.out.println(Converting  + fileName +  to html.);
 if (convertToHtml(fileName, newName))
 return newName;
 } catch (Exception e) {
 (new File(newName)).delete();
 System.out.println(General exception  + e.getMessage());
 }
 return null;
 }
 
 private static boolean convertToHtml(String fileName, String newName)
 throws Exception {
 // Read the file
 FileInputStream in = new FileInputStream(fileName);
 byte[] buf = new byte[in.available()];
 in.read(buf);
 in.close();
 
 // Parse the file
 Content content = new Content(file: + fileName, file: +
 fileName,
   buf, , new Metadata(), conf);
 ParseResult parseResult = new ParseUtil(conf).parse(content);
 parseResult.filter();
 if (parseResult.isEmpty()) {
 System.out.println(All parsing attempts failed);
 return false;
 }
 IteratorMap.Entrylt;Text,Parse iterator =
 parseResult.iterator();
 if (iterator == null) {
 System.out.println(Cannot iterate over successful parse
 results);
 return false;
 }
 Parse parse = null;
 ParseData parseData = null;
 while (iterator.hasNext()) {
 parse = parseResult.get((Text)iterator.next().getKey());
 parseData = parse.getData();
 ParseStatus status = parseData.getStatus();
 
 // If Parse failed then bail
 if (!ParseStatus.STATUS_SUCCESS.equals(status)) {
 System.out.println(Could not parse  + fileName + .  +
 status.getMessage());
 return false;
 }
 }
 
 // Start writing to newName
 FileOutputStream fout = new FileOutputStream(newName);
 PrintStream out = new PrintStream(fout, true, UTF-8);
 
 // Start Document
 out.println(html);
 
 // Start Header
 out.println(head);
 
 // Write Title
 String title = parseData.getTitle();
 if (title != null  title.trim().length()  0) {
 out.println(title + parseData.getTitle() + /title);
 }
 
 // Write out Meta tags
 Metadata metaData = parseData.getContentMeta();
 String[] names = metaData.names();
 for (String name : names) {
 String[] subvalues = metaData.getValues(name);
 String values = null;
 for (String subvalue : subvalues) {
 values += subvalue;
 }
 if (values.length()  0)
 out.printf(meta name=\%s\ content=\%s\/\n,
name, values);
 }
 out.println(meta http-equiv=\Content-Type\
 content=\text/html;charset=UTF-8\/);
 // End Meta tags
 
 out.println(/head); // End Header
 
 // Start Body
 out.println(body);
 out.print(parse.getText());
 out.println(/body); // End Body
 
 out.println(/html); // End Document
 
 out.close(); // Close the file
 
 return true;
 }
 }
 
 
 command:
 ==
 bash-2.00$ java -classpath
 conf:runtime/local/lib/nutch-1.3.jar:runtime/local/lib/hadoop-core-0.20.2.j
 ar:runtime/local/lib/commons-logging-api-1.0.4.jar:runtime/local/lib/tika-c
 ore-0.9.jar:runtime/local/lib/log4j-1.2.15.jar:runtime/local/lib/oro-2.0.8.
 jar:. TestParse 

Re: How does nutch handles javaScript in href

2011-10-19 Thread Marek Bachmann

On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote:

Then in my own opinion there is no existing code within parse-html which
prevents it from parsing the anchor snippts you've posted.


But something is happening with the content of the href attribute, since 
in the source file its value is:


a 
href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef'); 
class=mail


and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that 
means, that the href value is handled somehow?!


I guess if nothing would be done with the href value then the outlink 
value should be:


http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');

Perhaps the java script gets evaluated somewhere but it fails because 
the reference isn't found...


I'll look in the html parser to found more details.



This would make a great addition to the parse-html as it seems to be an
unforseen boundary case that we should not ignore.

If you don't get feedback on this, can I ask for you to open a JIRA
ticket based upon your understanding of the situation?

Thank you





Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
I'm getting a fairly persistent  timeout on a particular page. Other, smaller 
pages in this folder do fine, but this one times out most of the time. When it 
fails, my ParserChecker results look like:

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml
Exception in thread main java.lang.NullPointerException
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)

I've stuck with the default value of 10 in my nutch-default.xml's 
fetcher.threads.fetch value, and I've added the following to nutch-site.xml:

property
  namedb.max.outlinks.per.page/name
  value-1/value
  descriptionThe maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  /description
/property
property
  namefile.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content using the file://
  protocol, in bytes. If this value is nonnegative (=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the http.content.limit setting.
  /description
/property
property
  namehttp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be
  truncated; otherwise, no truncation at all.
  /description
/property
property
  nameftp.content.limit/name
  value-1/value
  descriptionThe length limit for downloaded content, in bytes.
  If this value is nonnegative (=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  Caution: classical ftp RFCs never defines partial transfer and, in fact,
  some ftp servers out there do not handle client side forced close-down very
  well. Our implementation tries its best to handle such situations smoothly.
  /description
/property
property
  namehttp.timeout/name
  value999/value
  descriptionThe default network timeout, in milliseconds./description
/property

What else can I do? Thanks.

Chip


Fetcher NPE's

2011-10-19 Thread Markus Jelsma
Hi,

We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's 
clear URL's are actually fetched until due to some reason a NPE occurs. The 
thread then dies and seems to output 0 records.

The URL's themselves are fetchable using index- or parser checker, no problem 
there. Any ideas how we can pinpoint the source of the issue? 

Thanks,

A sample exception:

2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of 
http://SOME_URL/ failed with: java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: 
java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
java.lang.System.arraycopy(Native Method)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.Text.write(Text.java:281)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher 
caught:java.lang.NullPointerException

The code catching the error:

801 } catch (Throwable t) { // unexpected exception
802 // unblock
803 fetchQueues.finishFetchItem(fit);
804 logError(fit.url, t.toString());
805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED, 
CrawlDatum.STATUS_FETCH_RETRY);
806 } 



Re: How does nutch handles javaScript in href

2011-10-19 Thread Marek Bachmann

One interesting thing I found out:

The HtmlParser Class tells me in debug mode (I had to replace the 
LOG.trace states through LOG.debug, since I don't know how to use these 
trace thing) that it had found 20 outlinks:


2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in 
http://www.uni-kassel.de/intranet/footernavi/redaktion.html


BUT the result of ParserChecker tells me there were 23 outlinks:

(...)
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php 
anchor:
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef 
anchor:
  outlink: toUrl: 
http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef 
anchor:

(...)

This first three links are the ones which shouldn't be there. And the 
count is the difference between the output if ParserChecker and the 
debug log.


Seems these links doesn't come to the list through HtmlParser?

On 19.10.2011 16:24, Marek Bachmann wrote:

On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote:

Then in my own opinion there is no existing code within parse-html which
prevents it from parsing the anchor snippts you've posted.


But something is happening with the content of the href attribute, since
in the source file its value is:

a
href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');
class=mail

and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that
means, that the href value is handled somehow?!

I guess if nothing would be done with the href value then the outlink
value should be:

http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');


Perhaps the java script gets evaluated somewhere but it fails because
the reference isn't found...

I'll look in the html parser to found more details.



This would make a great addition to the parse-html as it seems to be an
unforseen boundary case that we should not ignore.

If you don't get feedback on this, can I ask for you to open a JIRA
ticket based upon your understanding of the situation?

Thank you







Re: Good workaround for timeout?

2011-10-19 Thread Markus Jelsma
What is timing out, the fetch or the parse?

 I'm getting a fairly persistent  timeout on a particular page. Other,
 smaller pages in this folder do fine, but this one times out most of the
 time. When it fails, my ParserChecker results look like:
 
 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
 http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932Donal
 dsonLauren.xml Exception in thread main java.lang.NullPointerException
 at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
 
 I've stuck with the default value of 10 in my nutch-default.xml's
 fetcher.threads.fetch value, and I've added the following to
 nutch-site.xml:
 
 property
   namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for a
 page. If this value is nonnegative (=0), at most db.max.outlinks.per.page
 outlinks will be processed for a page; otherwise, all outlinks will be
 processed. /description
 /property
 property
   namefile.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content using the file://
   protocol, in bytes. If this value is nonnegative (=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
   /description
 /property
 property
   namehttp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be
   truncated; otherwise, no truncation at all.
   /description
 /property
 property
   nameftp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be
 truncated; otherwise, no truncation at all.
   Caution: classical ftp RFCs never defines partial transfer and, in fact,
   some ftp servers out there do not handle client side forced close-down
 very well. Our implementation tries its best to handle such situations
 smoothly. /description
 /property
 property
   namehttp.timeout/name
   value999/value
   descriptionThe default network timeout, in milliseconds./description
 /property
 
 What else can I do? Thanks.
 
 Chip


Re: How does nutch handles javaScript in href

2011-10-19 Thread Markus Jelsma
Tika can do things a bit different. At least it did in the past and it seems 
this is the case as well, i get 20 outlinks with Tika.

 One interesting thing I found out:
 
 The HtmlParser Class tells me in debug mode (I had to replace the
 LOG.trace states through LOG.debug, since I don't know how to use these
 trace thing) that it had found 20 outlinks:
 
 2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in
 http://www.uni-kassel.de/intranet/footernavi/redaktion.html
 
 BUT the result of ParserChecker tells me there were 23 outlinks:
 
 (...)
 Status: success(1,0)
 Title: Intranet: Redaktion
 Outlinks: 23
outlink: toUrl:
 http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//auto
 completion/completer.php anchor:
outlink: toUrl:
 http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef
 anchor:
outlink: toUrl:
 http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/e
 f anchor:
 (...)
 
 This first three links are the ones which shouldn't be there. And the
 count is the difference between the output if ParserChecker and the
 debug log.
 
 Seems these links doesn't come to the list through HtmlParser?
 
 On 19.10.2011 16:24, Marek Bachmann wrote:
  On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote:
  Then in my own opinion there is no existing code within parse-html which
  prevents it from parsing the anchor snippts you've posted.
  
  But something is happening with the content of the href attribute, since
  in the source file its value is:
  
  a
  href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');
  class=mail
  
  and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that
  means, that the href value is handled somehow?!
  
  I guess if nothing would be done with the href value then the outlink
  value should be:
  
  http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMai
  lto('nbjmup+jousbofuAvoj.lbttfm/ef');
  
  
  Perhaps the java script gets evaluated somewhere but it fails because
  the reference isn't found...
  
  I'll look in the html parser to found more details.
  
  This would make a great addition to the parse-html as it seems to be an
  unforseen boundary case that we should not ignore.
  
  If you don't get feedback on this, can I ask for you to open a JIRA
  ticket based upon your understanding of the situation?
  
  Thank you


RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
If I'm reading the log correctly, it's the fetch:

2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of 
http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932DonaldsonLauren.xml
 failed with: java.net.SocketTimeoutException: Read timed out


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 19, 2011 11:08 AM
To: user@nutch.apache.org
Subject: Re: Good workaround for timeout?

What is timing out, the fetch or the parse?

 I'm getting a fairly persistent  timeout on a particular page. Other, 
 smaller pages in this folder do fine, but this one times out most of 
 the time. When it fails, my ParserChecker results look like:
 
 # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
 http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D
 onal dsonLauren.xml Exception in thread main 
 java.lang.NullPointerException
 at 
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
 
 I've stuck with the default value of 10 in my nutch-default.xml's 
 fetcher.threads.fetch value, and I've added the following to
 nutch-site.xml:
 
 property
   namedb.max.outlinks.per.page/name
   value-1/value
   descriptionThe maximum number of outlinks that we'll process for a 
 page. If this value is nonnegative (=0), at most 
 db.max.outlinks.per.page outlinks will be processed for a page; 
 otherwise, all outlinks will be processed. /description /property 
 property
   namefile.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content using the file://
   protocol, in bytes. If this value is nonnegative (=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
   /description
 /property
 property
   namehttp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be
   truncated; otherwise, no truncation at all.
   /description
 /property
 property
   nameftp.content.limit/name
   value-1/value
   descriptionThe length limit for downloaded content, in bytes.
   If this value is nonnegative (=0), content longer than it will be 
 truncated; otherwise, no truncation at all.
   Caution: classical ftp RFCs never defines partial transfer and, in fact,
   some ftp servers out there do not handle client side forced 
 close-down very well. Our implementation tries its best to handle such 
 situations smoothly. /description /property property
   namehttp.timeout/name
   value999/value
   descriptionThe default network timeout, in 
 milliseconds./description /property
 
 What else can I do? Thanks.
 
 Chip


Re: Fetcher NPE's

2011-10-19 Thread Markus Jelsma
I should add that these URL's not only pass index-0 and parser checker but 
also manual local testing crawl cycles. There's also nothing significant in 
the syslog. Dmesg shows messages about too little memory but that's normal.

 Hi,
 
 We sometimes see a fetcher task failing with 0 pages. Inspecing the logs
 it's clear URL's are actually fetched until due to some reason a NPE
 occurs. The thread then dies and seems to output 0 records.
 
 The URL's themselves are fetchable using index- or parser checker, no
 problem there. Any ideas how we can pinpoint the source of the issue?
 
 Thanks,
 
 A sample exception:
 
 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
 http://SOME_URL/ failed with: java.lang.NullPointerException
 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
 java.lang.NullPointerException
 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
 java.lang.System.arraycopy(Native Method)
 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:
 1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
 :1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
 java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.io.Text.write(Text.java:281)
 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.se
 rialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR
 org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
 erialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR
 org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060
 ) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:5
 91) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
 caught:java.lang.NullPointerException
 
 The code catching the error:
 
 801   } catch (Throwable t) { // unexpected exception
 802   // unblock
 803   fetchQueues.finishFetchItem(fit);
 804   logError(fit.url, t.toString());
 805   output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
 CrawlDatum.STATUS_FETCH_RETRY);
 806   }


Re: Good workaround for timeout?

2011-10-19 Thread Markus Jelsma
It is indeed. Tricky.

Are you going through some proxy? Are you using protocol-http or httpclient? 
Are you sure the http.time.out value is actually used in lib-http?

 If I'm reading the log correctly, it's the fetch:
 
 2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of
 http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932Donal
 dsonLauren.xml failed with: java.net.SocketTimeoutException: Read timed out
 
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, October 19, 2011 11:08 AM
 To: user@nutch.apache.org
 Subject: Re: Good workaround for timeout?
 
 What is timing out, the fetch or the parse?
 
  I'm getting a fairly persistent  timeout on a particular page. Other,
  smaller pages in this folder do fine, but this one times out most of
  the time. When it fails, my ParserChecker results look like:
  
  # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText
  http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D
  onal dsonLauren.xml Exception in thread main
  java.lang.NullPointerException
  
  at
  
  org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
  
  I've stuck with the default value of 10 in my nutch-default.xml's
  fetcher.threads.fetch value, and I've added the following to
  nutch-site.xml:
  
  property
  
namedb.max.outlinks.per.page/name
value-1/value
descriptionThe maximum number of outlinks that we'll process for a
  
  page. If this value is nonnegative (=0), at most
  db.max.outlinks.per.page outlinks will be processed for a page;
  otherwise, all outlinks will be processed. /description /property
  property
  
namefile.content.limit/name
value-1/value
descriptionThe length limit for downloaded content using the file://
protocol, in bytes. If this value is nonnegative (=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
/description
  
  /property
  property
  
namehttp.content.limit/name
value-1/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
truncated; otherwise, no truncation at all.
/description
  
  /property
  property
  
nameftp.content.limit/name
value-1/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
  
  truncated; otherwise, no truncation at all.
  
Caution: classical ftp RFCs never defines partial transfer and, in
fact, some ftp servers out there do not handle client side forced
  
  close-down very well. Our implementation tries its best to handle such
  situations smoothly. /description /property property
  
namehttp.timeout/name
value999/value
descriptionThe default network timeout, in
  
  milliseconds./description /property
  
  What else can I do? Thanks.
  
  Chip


Re: FOUND IT - How does nutch handles javaScript in href

2011-10-19 Thread Marek Bachmann

Ok, I went though the source, step by step.

It is the HtmlParserFilter called JSParseFilter. So it seems I have to 
exclude it from the plugin list.


2011-10-19 17:33:46,031 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//autocompletion/completer.php'
2011-10-19 17:33:46,041 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef'
2011-10-19 17:33:46,042 DEBUG js.JSParseFilter -  - outlink from JS: 
'http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/ef'


But its behaviour isn't right anyway? It shouldn't take this crypto 
string as an outlink?


On 19.10.2011 17:13, Markus Jelsma wrote:

Tika can do things a bit different. At least it did in the past and it seems
this is the case as well, i get 20 outlinks with Tika.


One interesting thing I found out:

The HtmlParser Class tells me in debug mode (I had to replace the
LOG.trace states through LOG.debug, since I don't know how to use these
trace thing) that it had found 20 outlinks:

2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in
http://www.uni-kassel.de/intranet/footernavi/redaktion.html

BUT the result of ParserChecker tells me there were 23 outlinks:

(...)
Status: success(1,0)
Title: Intranet: Redaktion
Outlinks: 23
outlink: toUrl:
http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//auto
completion/completer.php anchor:
outlink: toUrl:
http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/ef
anchor:
outlink: toUrl:
http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm/e
f anchor:
(...)

This first three links are the ones which shouldn't be there. And the
count is the difference between the output if ParserChecker and the
debug log.

Seems these links doesn't come to the list through HtmlParser?

On 19.10.2011 16:24, Marek Bachmann wrote:

On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote:

Then in my own opinion there is no existing code within parse-html which
prevents it from parsing the anchor snippts you've posted.


But something is happening with the content of the href attribute, since
in the source file its value is:

a
href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');
class=mail

and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that
means, that the href value is handled somehow?!

I guess if nothing would be done with the href value then the outlink
value should be:

http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptMai
lto('nbjmup+jousbofuAvoj.lbttfm/ef');


Perhaps the java script gets evaluated somewhere but it fails because
the reference isn't found...

I'll look in the html parser to found more details.


This would make a great addition to the parse-html as it seems to be an
unforseen boundary case that we should not ignore.

If you don't get feedback on this, can I ask for you to open a JIRA
ticket based upon your understanding of the situation?

Thank you




Re: FOUND IT - How does nutch handles javaScript in href

2011-10-19 Thread Markus Jelsma
Not sure what JsParse is supposed to do in this situation but you should not 
use it anyway. It's not regarded as stable, just the protocolhttp.

 Ok, I went though the source, step by step.
 
 It is the HtmlParserFilter called JSParseFilter. So it seems I have to
 exclude it from the plugin list.
 
 2011-10-19 17:33:46,031 DEBUG js.JSParseFilter -  - outlink from JS:
 'http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//aut
 ocompletion/completer.php' 2011-10-19 17:33:46,041 DEBUG js.JSParseFilter -
  - outlink from JS:
 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/e
 f' 2011-10-19 17:33:46,042 DEBUG js.JSParseFilter -  - outlink from JS:
 'http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttfm
 /ef'
 
 But its behaviour isn't right anyway? It shouldn't take this crypto
 string as an outlink?
 
 On 19.10.2011 17:13, Markus Jelsma wrote:
  Tika can do things a bit different. At least it did in the past and it
  seems this is the case as well, i get 20 outlinks with Tika.
  
  One interesting thing I found out:
  
  The HtmlParser Class tells me in debug mode (I had to replace the
  LOG.trace states through LOG.debug, since I don't know how to use these
  trace thing) that it had found 20 outlinks:
  
  2011-10-19 16:59:38,061 DEBUG parse.html - found 20 outlinks in
  http://www.uni-kassel.de/intranet/footernavi/redaktion.html
  
  BUT the result of ParserChecker tells me there were 23 outlinks:
  
  (...)
  Status: success(1,0)
  Title: Intranet: Redaktion
  Outlinks: 23
  
  outlink: toUrl:
  http://www.uni-kassel.de/intranet/footernavi/typo3/ext/uk_solr_search//a
  uto
  
  completion/completer.php anchor:
  outlink: toUrl:
  http://www.uni-kassel.de/intranet/footernavi/nbjmup+jousbofuAvoj.lbttfm/
  ef
  
  anchor:
  outlink: toUrl:
  http://www.uni-kassel.de/intranet/footernavi/nbjmup+qptutufmmfAvoj.lbttf
  m/e f anchor:
  (...)
  
  This first three links are the ones which shouldn't be there. And the
  count is the difference between the output if ParserChecker and the
  debug log.
  
  Seems these links doesn't come to the list through HtmlParser?
  
  On 19.10.2011 16:24, Marek Bachmann wrote:
  On 19.10.2011 16:00, lewis.mcgibb...@gmail.com wrote:
  Then in my own opinion there is no existing code within parse-html
  which prevents it from parsing the anchor snippts you've posted.
  
  But something is happening with the content of the href attribute,
  since in the source file its value is:
  
  a
  href=javascript:linkTo_UnCryptMailto('nbjmup+jousbofuAvoj.lbttfm/ef');
   class=mail
  
  and after the parse it is just nbjmup+jousbofuAvoj.lbttfm/ef that
  means, that the href value is handled somehow?!
  
  I guess if nothing would be done with the href value then the outlink
  value should be:
  
  http://www.uni-kassel.de/intranet/footernavi/javascript:linkTo_UnCryptM
  ai lto('nbjmup+jousbofuAvoj.lbttfm/ef');
  
  
  Perhaps the java script gets evaluated somewhere but it fails because
  the reference isn't found...
  
  I'll look in the html parser to found more details.
  
  This would make a great addition to the parse-html as it seems to be
  an unforseen boundary case that we should not ignore.
  
  If you don't get feedback on this, can I ask for you to open a JIRA
  ticket based upon your understanding of the situation?
  
  Thank you


RE: Good workaround for timeout?

2011-10-19 Thread Chip Calhoun
I'm using protocol-http, but I removed protocol-httpclient after you pointed 
out in another thread that it's broken. Unfortunately I'm not sure which 
properties are used by what, and I'm not sure how to find out. I added some 
more stuff to nutch-site.xml (I'll paste it at the end), and it seems to be 
working so far; but since this has been an intermittent problem, I can't be 
sure whether I've really fixed it or whether I'm getting lucky.

property
  namehttp.timeout/name
  value999/value
  descriptionThe default network timeout, in milliseconds./description
/property
property
  nameftp.timeout/name
  value99/value
  descriptionDefault timeout for ftp client socket, in millisec.
  Please also see ftp.keep.connection below./description
/property
property
  nameftp.server.timeout/name
  value9/value
  descriptionAn estimation of ftp server idle time, in millisec.
  Typically it is 12 millisec for many ftp servers out there.
  Better be conservative here. Together with ftp.timeout, it is used to
  decide if we need to delete (annihilate) current ftp.client instance and
  force to start another ftp.client instance anew. This is necessary because
  a fetcher thread may not be able to obtain next request from queue in time
  (due to idleness) before our ftp client times out or remote server
  disconnects. Used only when ftp.keep.connection is true (please see below).
  /description
/property
property
  nameparser.timeout/name
  value300/value
  descriptionTimeout in seconds for the parsing of a document, otherwise 
treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser 
implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  /description
/property

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, October 19, 2011 11:28 AM
To: user@nutch.apache.org
Subject: Re: Good workaround for timeout?

It is indeed. Tricky.

Are you going through some proxy? Are you using protocol-http or httpclient? 
Are you sure the http.time.out value is actually used in lib-http?

 If I'm reading the log correctly, it's the fetch:
 
 2011-10-19 11:18:11,405 INFO  fetcher.Fetcher - fetch of 
 http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_2932D
 onal dsonLauren.xml failed with: java.net.SocketTimeoutException: Read 
 timed out
 
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io]
 Sent: Wednesday, October 19, 2011 11:08 AM
 To: user@nutch.apache.org
 Subject: Re: Good workaround for timeout?
 
 What is timing out, the fetch or the parse?
 
  I'm getting a fairly persistent  timeout on a particular page. 
  Other, smaller pages in this folder do fine, but this one times out 
  most of the time. When it fails, my ParserChecker results look like:
  
  # bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
  http://digital.lib.washington.edu/findingaids/view?docId=UA37_06_293
  2D onal dsonLauren.xml Exception in thread main
  java.lang.NullPointerException
  
  at
  
  org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:84)
  
  I've stuck with the default value of 10 in my nutch-default.xml's 
  fetcher.threads.fetch value, and I've added the following to
  nutch-site.xml:
  
  property
  
namedb.max.outlinks.per.page/name
value-1/value
descriptionThe maximum number of outlinks that we'll process for 
  a
  
  page. If this value is nonnegative (=0), at most 
  db.max.outlinks.per.page outlinks will be processed for a page; 
  otherwise, all outlinks will be processed. /description 
  /property property
  
namefile.content.limit/name
value-1/value
descriptionThe length limit for downloaded content using the file://
protocol, in bytes. If this value is nonnegative (=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
/description
  
  /property
  property
  
namehttp.content.limit/name
value-1/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
truncated; otherwise, no truncation at all.
/description
  
  /property
  property
  
nameftp.content.limit/name
value-1/value
descriptionThe length limit for downloaded content, in bytes.
If this value is nonnegative (=0), content longer than it will be
  
  truncated; otherwise, no truncation at all.
  
Caution: classical ftp RFCs never defines partial transfer and, in
fact, some ftp servers out there do not handle client side forced
  
  close-down very well. Our implementation tries its best to handle 
  such situations smoothly. /description /property property
  
namehttp.timeout/name
value999/value

Is there a workaround for https?

2011-10-19 Thread Chip Calhoun
I've noticed the recent posts about trouble with protocol-httpclient, which to 
my understanding is needed for https URLs. Is there another way to handle 
these? ParserChecker gives me the following when I try one of these URLs. 
Thanks.

# bin/nutch org.apache.nutch.parse.ParserChecker -dumpText 
https://libwebspace.library.cmu.edu:4430/Research/Archives/ead/generated/shull.xml
Exception in thread main org.apache.nutch.protocol.ProtocolNotFound: protocol 
not found for url=https
at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:80)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:78)