RE: How to get Text and Parse data for URL

2006-04-26 Thread Prashant Purkar
One trick would be to search on a URL, explain link shows what segments
it belongs to, say 1200604211450.

Then using segread command (this works for 0.7.2)

bin/nutch segread -dumpsort -nocontent  segments/1200604211450   

That shows text, parse data for a URL.

Thanks
P




-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 26, 2006 1:42 AM
To: nutch-user@lucene.apache.org
Subject: How to get Text and Parse data for URL

Can somebody direct me on how to get the stored text and parse metadata 
for a given url?

Dennis


Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
NutchBean.getContent() and NutchBean.getParseData() do this, but require 
a HitDetails instance.  In the non-distributed case, the only required 
field of the HitDetails for these calls is url.  In the distributed 
case, the segment field must also be provided, so that the request can 
be routed to a node serving that segment.  These are implemented by 
FetchedSegments.java and DistributedSearch.java.


Doug

Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse metadata 
for a given url?


Dennis


Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
That got me started.  I think that I am not fully understanding the role 
the segments directory and its contents play.  It looks like it holds 
parse text and parse data in map files, but what is the content folder 
(also a map file)?  And is the segments contents used once the index is 
created?


Dennis Kubes


Doug Cutting wrote:
NutchBean.getContent() and NutchBean.getParseData() do this, but 
require a HitDetails instance.  In the non-distributed case, the only 
required field of the HitDetails for these calls is url.  In the 
distributed case, the segment field must also be provided, so that 
the request can be routed to a node serving that segment.  These are 
implemented by FetchedSegments.java and DistributedSearch.java.


Doug

Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse 
metadata for a given url?


Dennis


Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes

Truly I am just not understanding the concept of a segment.

Dennis Kubes wrote:
That got me started.  I think that I am not fully understanding the 
role the segments directory and its contents play.  It looks like it 
holds parse text and parse data in map files, but what is the content 
folder (also a map file)?  And is the segments contents used once the 
index is created?


Dennis Kubes


Doug Cutting wrote:
NutchBean.getContent() and NutchBean.getParseData() do this, but 
require a HitDetails instance.  In the non-distributed case, the only 
required field of the HitDetails for these calls is url.  In the 
distributed case, the segment field must also be provided, so that 
the request can be routed to a node serving that segment.  These are 
implemented by FetchedSegments.java and DistributedSearch.java.


Doug

Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse 
metadata for a given url?


Dennis


Re: How to get Text and Parse data for URL

2006-04-25 Thread Andrzej Bialecki

Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse 
metadata for a given url?


From a single segment, or from a set of segments?

From a single segment: please see how SegmentReader.get() does this 
(although it's a bit obscured by the fact that it uses multiple threads 
to retrieve different parts of the data).


For multiple segments, it would help if you knew in advance which 
segment holds the data associated with the URL, that's what normally the 
Lucene index is for ;) - please see FetchedSegments for details.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting

Dennis Kubes wrote:
I think that I am not fully understanding the role 
the segments directory and its contents play.


A segment is simply a set of urls fetched in the same round, and data 
associated with these urls.  The content subdirectory contains the raw 
http content.  The parse-text subdirectory contains the extracted text, 
used when indexing and when building snippets for hits.  The index 
subdirectory holds a Lucene index of the pages in the segment.  Etc.  It 
is an independent chunk of Nutch data.


In 0.8, each segment subdirectory is further split into parts, the 
result of distributed processing.  The parts are split by the hash of 
the url.


Does that help?

Doug