Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread Andrzej Bialecki

malcolm smith wrote:

I am looking to create a parser for a groupware product that would read
pages message board type web site.  (Think phpBB).  But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master document (for the
original post) and an additional document for each reply item.

I've reviewed the code for protocol plugins, parser plugins and indexing
plugins but each interface allows for a single document or content object to
be passed around.

Am I missing something simple?

My best bet at the moment is to implement some kind of new fake protocol
for the reply items then I would use the http client plugin for the first
request to the page and generate outlines on the
fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go
back through and fetch the sub page content.  But this seems round-about and
would probably generate an http request for each reply on the original
page.  But perhaps there is a way to lookup the original page in the segment
db before requesting it again.

Needless to say it would seem more straightforward to tackle this in some
kind of parser plugin that could break the original page into pieces that
are treated as standalone pages for indexing purposes.

Last but not least conceptually a plugin for the indexer might be able to
take a set of custom meta data for a replies collection and index it as
separate lucene documents - but I can't find a way to do this given the
interfaces in the indexer plugins.

Thanks in advance
Malcolm Smith


What version of Nutch are you using? This should be already possible to 
do using the 1.0 release or a nightly build. ParseResult (which is what 
parsers produce) can hold multiple Parse objects, each with its own URL.


The common approach to handle whole-part relationships (like zip/tar 
archives, RSS, and other compound docs) is to split them in the parser 
and parse each part, then give each sub-document its own URL (e.g 
file.tar!myfile.txt) and add the original URL in the metadata, to keep 
track of the parent URL. The rest should be handled automatically, 
although there are some other complications that need to be handled as 
well (e.g. don't recrawl sub-documents).




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Nutch crawler charset issues utf-16

2009-10-20 Thread John_C_3

I'm attempting to crawl pages with charset utf-16 and send the index to solr
where it can be searched.  I followed the instructions 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ here  and
successfully crawled and searched test content with utf-8. However, when I
attempt to crawl the utf-16 content it gets sent to solr as japanese
characters. The pages encoded as utf-16 contain only english text, no
special characters. Is there anyway to force nutch to crawl the page as
utf-8 and ignore the utf-16 setting?

Thanks.
-- 
View this message in context: 
http://www.nabble.com/Nutch-crawler-charset-issues-utf-16-tp25981513p25981513.html
Sent from the Nutch - User mailing list archive at Nabble.com.



crawl always stops at depth=3

2009-10-20 Thread nutchcase

My crawl always stops at depth=3. It gets documents but does not continue any
further.
Here is my nutch-site.xml
?xml version=1.0?
configuration
property
namehttp.agent.name/name
valuenutch-solr-integration/value
/property
property
namegenerate.max.per.host/name
value1000/value
/property
property
nameplugin.includes/name
valueprotocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
lizer-(pass|regex|basic)/value
/property
property
namedb.max.outlinks.per.page/name
value1000/value
/property
/configuration


-- 
View this message in context: 
http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p25981603.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: crawl always stops at depth=3

2009-10-20 Thread reinhard schwab
try

bin/nutch readdb crawl/crawldb -stats

are there any unfetched pages?

nutchcase schrieb:
 My crawl always stops at depth=3. It gets documents but does not continue any
 further.
 Here is my nutch-site.xml
 ?xml version=1.0?
 configuration
 property
 namehttp.agent.name/name
 valuenutch-solr-integration/value
 /property
 property
 namegenerate.max.per.host/name
 value1000/value
 /property
 property
 nameplugin.includes/name
 valueprotocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\
 lizer-(pass|regex|basic)/value
 /property
 property
 namedb.max.outlinks.per.page/name
 value1000/value
 /property
 /configuration


   



ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood
This is the error I keep getting whenever I try to fetch more than  
400K files at a time using a 4 node hadoop cluster running nutch 1.0.


org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed  
to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/ 
part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on  
client 192.168.1.201 because current leaseholder is trying to recreate  
file.


Can anybody shed some light on this issue? I was under the impression  
that 400K was small potatoes for a nutch hadoop combo?


Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Andrzej Bialecki

Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than 400K 
files at a time using a 4 node hadoop cluster running nutch 1.0.


org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to 
create file 
/user/hadoop/crawl/segments/20091013161641/crawl_fetch/part-00015/index 
for DFSClient_attempt_200910131302_0011_r_15_2 on client 
192.168.1.201 because current leaseholder is trying to recreate file.


Please see this issue:

https://issues.apache.org/jira/browse/NUTCH-692

Apply the patch that is attached there, rebuild Nutch, and tell me if 
this fixes your problem.


(the patch will be applied to trunk anyway, since others confirmed that 
it fixes this issue).




Can anybody shed some light on this issue? I was under the impression 
that 400K was small potatoes for a nutch hadoop combo?


It is. This problem is rare - I think I crawled cumulatively ~500mln 
pages in various configs and it didn't occur to me personally. It 
requires a few things to go wrong (see the issue comments).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: ERROR: current leaseholder is trying to recreate file.

2009-10-20 Thread Eric Osgood

Andrzej,

I just downloaded the most recent trunk from svn as per your  
recommendations for fixing the generate bug. As soon I have it all  
rebuilt with my configs I will let you know how a crawl of ~1.6mln  
pages goes. Hopefully no errors!


Thanks,

Eric

On Oct 20, 2009, at 2:13 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:
This is the error I keep getting whenever I try to fetch more than  
400K files at a time using a 4 node hadoop cluster running nutch 1.0.
org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:  
failed to create file /user/hadoop/crawl/segments/20091013161641/ 
crawl_fetch/part-00015/index for  
DFSClient_attempt_200910131302_0011_r_15_2 on client  
192.168.1.201 because current leaseholder is trying to recreate file.


Please see this issue:

https://issues.apache.org/jira/browse/NUTCH-692

Apply the patch that is attached there, rebuild Nutch, and tell me  
if this fixes your problem.


(the patch will be applied to trunk anyway, since others confirmed  
that it fixes this issue).


Can anybody shed some light on this issue? I was under the  
impression that 400K was small potatoes for a nutch hadoop combo?


It is. This problem is rare - I think I crawled cumulatively ~500mln  
pages in various configs and it didn't occur to me personally. It  
requires a few things to go wrong (see the issue comments).



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread malcolm smith
Thank you very much for the helpful reply, I'm back on track.


On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki a...@getopt.org wrote:

 malcolm smith wrote:

 I am looking to create a parser for a groupware product that would read
 pages message board type web site.  (Think phpBB).  But rather than
 creating
 a single Content item which is parsed and indexed to a single lucene
 document, I am planning to have the parser create a master document (for
 the
 original post) and an additional document for each reply item.

 I've reviewed the code for protocol plugins, parser plugins and indexing
 plugins but each interface allows for a single document or content object
 to
 be passed around.

 Am I missing something simple?

 My best bet at the moment is to implement some kind of new fake protocol
 for the reply items then I would use the http client plugin for the first
 request to the page and generate outlines on the
 fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to
 go
 back through and fetch the sub page content.  But this seems round-about
 and
 would probably generate an http request for each reply on the original
 page.  But perhaps there is a way to lookup the original page in the
 segment
 db before requesting it again.

 Needless to say it would seem more straightforward to tackle this in some
 kind of parser plugin that could break the original page into pieces that
 are treated as standalone pages for indexing purposes.

 Last but not least conceptually a plugin for the indexer might be able to
 take a set of custom meta data for a replies collection and index it as
 separate lucene documents - but I can't find a way to do this given the
 interfaces in the indexer plugins.

 Thanks in advance
 Malcolm Smith


 What version of Nutch are you using? This should be already possible to do
 using the 1.0 release or a nightly build. ParseResult (which is what parsers
 produce) can hold multiple Parse objects, each with its own URL.

 The common approach to handle whole-part relationships (like zip/tar
 archives, RSS, and other compound docs) is to split them in the parser and
 parse each part, then give each sub-document its own URL (e.g
 file.tar!myfile.txt) and add the original URL in the metadata, to keep track
 of the parent URL. The rest should be handled automatically, although there
 are some other complications that need to be handled as well (e.g. don't
 recrawl sub-documents).



 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com