RE: How to get Text and Parse data for URL
One trick would be to search on a URL, explain link shows what segments it belongs to, say 1200604211450. Then using segread command (this works for 0.7.2) bin/nutch segread -dumpsort -nocontent segments/1200604211450 That shows text, parse data for a URL. Thanks P -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 26, 2006 1:42 AM To: nutch-user@lucene.apache.org Subject: How to get Text and Parse data for URL Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis
Re: How to get Text and Parse data for URL
NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed to a node serving that segment. These are implemented by FetchedSegments.java and DistributedSearch.java. Doug Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis
Re: How to get Text and Parse data for URL
That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder (also a map file)? And is the segments contents used once the index is created? Dennis Kubes Doug Cutting wrote: NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed to a node serving that segment. These are implemented by FetchedSegments.java and DistributedSearch.java. Doug Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis
Re: How to get Text and Parse data for URL
Truly I am just not understanding the concept of a segment. Dennis Kubes wrote: That got me started. I think that I am not fully understanding the role the segments directory and its contents play. It looks like it holds parse text and parse data in map files, but what is the content folder (also a map file)? And is the segments contents used once the index is created? Dennis Kubes Doug Cutting wrote: NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed to a node serving that segment. These are implemented by FetchedSegments.java and DistributedSearch.java. Doug Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis
Re: How to get Text and Parse data for URL
Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? From a single segment, or from a set of segments? From a single segment: please see how SegmentReader.get() does this (although it's a bit obscured by the fact that it uses multiple threads to retrieve different parts of the data). For multiple segments, it would help if you knew in advance which segment holds the data associated with the URL, that's what normally the Lucene index is for ;) - please see FetchedSegments for details. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to get Text and Parse data for URL
Dennis Kubes wrote: I think that I am not fully understanding the role the segments directory and its contents play. A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text subdirectory contains the extracted text, used when indexing and when building snippets for hits. The index subdirectory holds a Lucene index of the pages in the segment. Etc. It is an independent chunk of Nutch data. In 0.8, each segment subdirectory is further split into parts, the result of distributed processing. The parts are split by the hash of the url. Does that help? Doug