One trick would be to search on a URL, explain link shows what segments
it belongs to, say 1200604211450.
Then using segread command (this works for 0.7.2)
bin/nutch segread -dumpsort -nocontent segments/1200604211450
That shows text, parse data for a URL.
Thanks
P
-Original
NutchBean.getContent() and NutchBean.getParseData() do this, but require
a HitDetails instance. In the non-distributed case, the only required
field of the HitDetails for these calls is url. In the distributed
case, the segment field must also be provided, so that the request can
be routed
That got me started. I think that I am not fully understanding the role
the segments directory and its contents play. It looks like it holds
parse text and parse data in map files, but what is the content folder
(also a map file)? And is the segments contents used once the index is
created?
Truly I am just not understanding the concept of a segment.
Dennis Kubes wrote:
That got me started. I think that I am not fully understanding the
role the segments directory and its contents play. It looks like it
holds parse text and parse data in map files, but what is the content
folder
Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse
metadata for a given url?
From a single segment, or from a set of segments?
From a single segment: please see how SegmentReader.get() does this
(although it's a bit obscured by the fact that it uses multiple
Dennis Kubes wrote:
I think that I am not fully understanding the role
the segments directory and its contents play.
A segment is simply a set of urls fetched in the same round, and data
associated with these urls. The content subdirectory contains the raw
http content. The parse-text