If I were you I would use Luke ( http://code.google.com/p/luke/ ) to examine what data do you have on your indexes if you're using lucene indexes :)
On Tue, Aug 24, 2010 at 6:21 PM, Henry Noerdlinger <[email protected]> wrote: > Thank you for response. > > I ran a simple test where I constructed a QueryParams object and have field / > value of "url" and "http://blahblah.com/" > and then added this to a Query object and passed this to my beloved NutchBean > to search for like this: > String urlVal = "http://domain.com/webapp/content.do"; > QueryParams qp = new QueryParams(); > qp.put("url", urlVal); > Configuration conf = NutchConfiguration.create(); > NutchBean bean = new NutchBean(conf); > Query query = new Query(conf); > query.setParams(qp); > Hits hits = bean.search(query); > > Didn't get anything. > > > Is there someone who can give me a quick example of how this could be done? > > > > ________________________________________ > From: CatOs Mandros [[email protected]] > Sent: Tuesday, August 24, 2010 4:10 AM > To: [email protected] > Subject: Re: find segment for an url > > Hi Henry, > > If i'm not mistaken, the correct way to handle this is to query your > index . It should have the information about what segment is the URL > located. Then you should only have to run your code on the segment > returned to get the content. > > > On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger > <[email protected]> wrote: >> I want to loop through URLs which have been crawled / indexed. >> >> I have a (known) subset of URLs that I want to get the (raw) content for >> >> if I know the segment, I can do something like this: >> String segName = "20100817162607"; >> String url = "http://adomain.com/awebappOfInterest/someContent.do"; >> >> HitDetails detail = new HitDetails(segName, url); >> Configuration conf = NutchConfiguration.create(); >> >> NutchBean bean = new NutchBean(conf); >> >> byte[] contentBytes = bean.getContent(detail); >> for (byte b : contentBytes) >> { >> System.out.print((char)b); >> } >> >> My question is, given, a known Url, how can I find what segment it is in? Is >> there something in the API for giving an URL and getting back the name of >> the segment it is found in? >> >> regards, >> -henry >> [email protected] >> >> InfoNow Corporation | This communication, including attachments, is for >> the exclusive use of addressee and may contain proprietary, confidential or >> privileged information. >> > > > InfoNow Corporation | This communication, including attachments, is for the > exclusive use of addressee and may contain proprietary, confidential or > privileged information. >

