Re: Extending HTML Parser to create subpage index documents

2009-10-20 Thread malcolm smith
Thank you very much for the helpful reply, I'm back on track.


On Tue, Oct 20, 2009 at 2:01 AM, Andrzej Bialecki  wrote:

> malcolm smith wrote:
>
>> I am looking to create a parser for a groupware product that would read
>> pages message board type web site.  (Think phpBB).  But rather than
>> creating
>> a single Content item which is parsed and indexed to a single lucene
>> document, I am planning to have the parser create a master document (for
>> the
>> original post) and an additional document for each reply item.
>>
>> I've reviewed the code for protocol plugins, parser plugins and indexing
>> plugins but each interface allows for a single document or content object
>> to
>> be passed around.
>>
>> Am I missing something simple?
>>
>> My best bet at the moment is to implement some kind of new "fake" protocol
>> for the reply items then I would use the http client plugin for the first
>> request to the page and generate outlines on the
>> "fakereplyto://originalurl/reply1" "fakereplyto://originalurl/reply2" to
>> go
>> back through and fetch the sub page content.  But this seems round-about
>> and
>> would probably generate an http request for each reply on the original
>> page.  But perhaps there is a way to lookup the original page in the
>> segment
>> db before requesting it again.
>>
>> Needless to say it would seem more straightforward to tackle this in some
>> kind of parser plugin that could break the original page into pieces that
>> are treated as standalone pages for indexing purposes.
>>
>> Last but not least conceptually a plugin for the indexer might be able to
>> take a set of custom meta data for a replies "collection" and index it as
>> separate lucene documents - but I can't find a way to do this given the
>> interfaces in the indexer plugins.
>>
>> Thanks in advance
>> Malcolm Smith
>>
>
> What version of Nutch are you using? This should be already possible to do
> using the 1.0 release or a nightly build. ParseResult (which is what parsers
> produce) can hold multiple Parse objects, each with its own URL.
>
> The common approach to handle whole-part relationships (like zip/tar
> archives, RSS, and other compound docs) is to split them in the parser and
> parse each part, then give each sub-document its own URL (e.g
> file.tar!myfile.txt) and add the original URL in the metadata, to keep track
> of the parent URL. The rest should be handled automatically, although there
> are some other complications that need to be handled as well (e.g. don't
> recrawl sub-documents).
>
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: Extending HTML Parser to create subpage index documents

2009-10-19 Thread Andrzej Bialecki

malcolm smith wrote:

I am looking to create a parser for a groupware product that would read
pages message board type web site.  (Think phpBB).  But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master document (for the
original post) and an additional document for each reply item.

I've reviewed the code for protocol plugins, parser plugins and indexing
plugins but each interface allows for a single document or content object to
be passed around.

Am I missing something simple?

My best bet at the moment is to implement some kind of new "fake" protocol
for the reply items then I would use the http client plugin for the first
request to the page and generate outlines on the
"fakereplyto://originalurl/reply1" "fakereplyto://originalurl/reply2" to go
back through and fetch the sub page content.  But this seems round-about and
would probably generate an http request for each reply on the original
page.  But perhaps there is a way to lookup the original page in the segment
db before requesting it again.

Needless to say it would seem more straightforward to tackle this in some
kind of parser plugin that could break the original page into pieces that
are treated as standalone pages for indexing purposes.

Last but not least conceptually a plugin for the indexer might be able to
take a set of custom meta data for a replies "collection" and index it as
separate lucene documents - but I can't find a way to do this given the
interfaces in the indexer plugins.

Thanks in advance
Malcolm Smith


What version of Nutch are you using? This should be already possible to 
do using the 1.0 release or a nightly build. ParseResult (which is what 
parsers produce) can hold multiple Parse objects, each with its own URL.


The common approach to handle whole-part relationships (like zip/tar 
archives, RSS, and other compound docs) is to split them in the parser 
and parse each part, then give each sub-document its own URL (e.g 
file.tar!myfile.txt) and add the original URL in the metadata, to keep 
track of the parent URL. The rest should be handled automatically, 
although there are some other complications that need to be handled as 
well (e.g. don't recrawl sub-documents).




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Extending HTML Parser to create subpage index documents

2009-10-19 Thread malcolm smith
I am looking to create a parser for a groupware product that would read
pages message board type web site.  (Think phpBB).  But rather than creating
a single Content item which is parsed and indexed to a single lucene
document, I am planning to have the parser create a master document (for the
original post) and an additional document for each reply item.

I've reviewed the code for protocol plugins, parser plugins and indexing
plugins but each interface allows for a single document or content object to
be passed around.

Am I missing something simple?

My best bet at the moment is to implement some kind of new "fake" protocol
for the reply items then I would use the http client plugin for the first
request to the page and generate outlines on the
"fakereplyto://originalurl/reply1" "fakereplyto://originalurl/reply2" to go
back through and fetch the sub page content.  But this seems round-about and
would probably generate an http request for each reply on the original
page.  But perhaps there is a way to lookup the original page in the segment
db before requesting it again.

Needless to say it would seem more straightforward to tackle this in some
kind of parser plugin that could break the original page into pieces that
are treated as standalone pages for indexing purposes.

Last but not least conceptually a plugin for the indexer might be able to
take a set of custom meta data for a replies "collection" and index it as
separate lucene documents - but I can't find a way to do this given the
interfaces in the indexer plugins.

Thanks in advance
Malcolm Smith