Re: Extending HTML Parser to create subpage index documents
malcolm smith wrote: I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather than creating a single Content item which is parsed and indexed to a single lucene document, I am planning to have the parser create a master document (for the original post) and an additional document for each reply item. I've reviewed the code for protocol plugins, parser plugins and indexing plugins but each interface allows for a single document or content object to be passed around. Am I missing something simple? My best bet at the moment is to implement some kind of new fake protocol for the reply items then I would use the http client plugin for the first request to the page and generate outlines on the fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go back through and fetch the sub page content. But this seems round-about and would probably generate an http request for each reply on the original page. But perhaps there is a way to lookup the original page in the segment db before requesting it again. Needless to say it would seem more straightforward to tackle this in some kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies collection and index it as separate lucene documents - but I can't find a way to do this given the interfaces in the indexer plugins. Thanks in advance Malcolm Smith What version of Nutch are you using? This should be already possible to do using the 1.0 release or a nightly build. ParseResult (which is what parsers produce) can hold multiple Parse objects, each with its own URL. The common approach to handle whole-part relationships (like zip/tar archives, RSS, and other compound docs) is to split them in the parser and parse each part, then give each sub-document its own URL (e.g file.tar!myfile.txt) and add the original URL in the metadata, to keep track of the parent URL. The rest should be handled automatically, although there are some other complications that need to be handled as well (e.g. don't recrawl sub-documents). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch crawler charset issues utf-16
I'm attempting to crawl pages with charset utf-16 and send the index to solr where it can be searched. I followed the instructions http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ here and successfully crawled and searched test content with utf-8. However, when I attempt to crawl the utf-16 content it gets sent to solr as japanese characters. The pages encoded as utf-16 contain only english text, no special characters. Is there anyway to force nutch to crawl the page as utf-8 and ignore the utf-16 setting? Thanks. -- View this message in context: http://www.nabble.com/Nutch-crawler-charset-issues-utf-16-tp25981513p25981513.html Sent from the Nutch - User mailing list archive at Nabble.com.
crawl always stops at depth=3
My crawl always stops at depth=3. It gets documents but does not continue any further. Here is my nutch-site.xml ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value1000/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\ lizer-(pass|regex|basic)/value /property property namedb.max.outlinks.per.page/name value1000/value /property /configuration -- View this message in context: http://www.nabble.com/crawl-always-stops-at-depth%3D3-tp25981603p25981603.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: crawl always stops at depth=3
try bin/nutch readdb crawl/crawldb -stats are there any unfetched pages? nutchcase schrieb: My crawl always stops at depth=3. It gets documents but does not continue any further. Here is my nutch-site.xml ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value1000/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-(crawl|regex)|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnorma\ lizer-(pass|regex|basic)/value /property property namedb.max.outlinks.per.page/name value1000/value /property /configuration
ERROR: current leaseholder is trying to recreate file.
This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/ part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: ERROR: current leaseholder is trying to recreate file.
Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Please see this issue: https://issues.apache.org/jira/browse/NUTCH-692 Apply the patch that is attached there, rebuild Nutch, and tell me if this fixes your problem. (the patch will be applied to trunk anyway, since others confirmed that it fixes this issue). Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: ERROR: current leaseholder is trying to recreate file.
Andrzej, I just downloaded the most recent trunk from svn as per your recommendations for fixing the generate bug. As soon I have it all rebuilt with my configs I will let you know how a crawl of ~1.6mln pages goes. Hopefully no errors! Thanks, Eric On Oct 20, 2009, at 2:13 PM, Andrzej Bialecki wrote: Eric Osgood wrote: This is the error I keep getting whenever I try to fetch more than 400K files at a time using a 4 node hadoop cluster running nutch 1.0. org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/ crawl_fetch/part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. Please see this issue: https://issues.apache.org/jira/browse/NUTCH-692 Apply the patch that is attached there, rebuild Nutch, and tell me if this fixes your problem. (the patch will be applied to trunk anyway, since others confirmed that it fixes this issue). Can anybody shed some light on this issue? I was under the impression that 400K was small potatoes for a nutch hadoop combo? It is. This problem is rare - I think I crawled cumulatively ~500mln pages in various configs and it didn't occur to me personally. It requires a few things to go wrong (see the issue comments). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Extending HTML Parser to create subpage index documents
Thank you very much for the helpful reply, I'm back on track. On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki a...@getopt.org wrote: malcolm smith wrote: I am looking to create a parser for a groupware product that would read pages message board type web site. (Think phpBB). But rather than creating a single Content item which is parsed and indexed to a single lucene document, I am planning to have the parser create a master document (for the original post) and an additional document for each reply item. I've reviewed the code for protocol plugins, parser plugins and indexing plugins but each interface allows for a single document or content object to be passed around. Am I missing something simple? My best bet at the moment is to implement some kind of new fake protocol for the reply items then I would use the http client plugin for the first request to the page and generate outlines on the fakereplyto://originalurl/reply1 fakereplyto://originalurl/reply2 to go back through and fetch the sub page content. But this seems round-about and would probably generate an http request for each reply on the original page. But perhaps there is a way to lookup the original page in the segment db before requesting it again. Needless to say it would seem more straightforward to tackle this in some kind of parser plugin that could break the original page into pieces that are treated as standalone pages for indexing purposes. Last but not least conceptually a plugin for the indexer might be able to take a set of custom meta data for a replies collection and index it as separate lucene documents - but I can't find a way to do this given the interfaces in the indexer plugins. Thanks in advance Malcolm Smith What version of Nutch are you using? This should be already possible to do using the 1.0 release or a nightly build. ParseResult (which is what parsers produce) can hold multiple Parse objects, each with its own URL. The common approach to handle whole-part relationships (like zip/tar archives, RSS, and other compound docs) is to split them in the parser and parse each part, then give each sub-document its own URL (e.g file.tar!myfile.txt) and add the original URL in the metadata, to keep track of the parent URL. The rest should be handled automatically, although there are some other complications that need to be handled as well (e.g. don't recrawl sub-documents). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com