RE: How to get Text and Parse data for URL

2006-04-26 Thread Prashant Purkar
One trick would be to search on a URL, explain link shows what segments it belongs to, say 1200604211450. Then using segread command (this works for 0.7.2) bin/nutch segread -dumpsort -nocontent segments/1200604211450 That shows text, parse data for a URL. Thanks P -Original

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed

Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder (also a map file)? And is the segments contents used once the index is created?

Re: How to get Text and Parse data for URL

2006-04-25 Thread Dennis Kubes
Truly I am just not understanding the concept of a segment. Dennis Kubes wrote: That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder

Re: How to get Text and Parse data for URL

2006-04-25 Thread Andrzej Bialecki
Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? From a single segment, or from a set of segments? From a single segment: please see how SegmentReader.get() does this (although it's a bit obscured by the fact that it uses multiple

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting
Dennis Kubes wrote: I think that I am not fully understanding the role the segments directory and its contents play. A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text