Thank you,
That what I had already begun to look into. At some point in my process, a
configuration will need to be set which will define search that will identify a
specific set of urls to get content from.
I was planning on creating some code to sift through readlinkdb results.
My solution is now like this:
//this is a pattern which identifies a set of urls
String term = "partnerdetails";
String field = "url";
Configuration conf = NutchConfiguration.create();
NutchBean bean = new NutchBean(conf);
Query query = new Query(conf);
query.addRequiredTerm(term,field);
Hits hits = bean.search(query);
for (int i = 0; i < hits.getLength(); i++)
{
Hit hit = hits.getHit(i);
HitDetails detail = bean.getDetails(hit);
byte[] contentBytes = bean.getContent(detail);
StringBuilder content = new StringBuilder();
for (byte b : contentBytes)
{
content.append((char)b);
}
}
________________________________________
From: CatOs Mandros [[email protected]]
Sent: Tuesday, August 24, 2010 1:52 PM
To: [email protected]
Subject: Re: find segment for an url
If I were you I would use Luke ( http://code.google.com/p/luke/ ) to
examine what data do you have on your indexes if you're using lucene
indexes :)
On Tue, Aug 24, 2010 at 6:21 PM, Henry Noerdlinger
<[email protected]> wrote:
> Thank you for response.
>
> I ran a simple test where I constructed a QueryParams object and have field /
> value of "url" and "http://blahblah.com/"
> and then added this to a Query object and passed this to my beloved NutchBean
> to search for like this:
> String urlVal = "http://domain.com/webapp/content.do";
> QueryParams qp = new QueryParams();
> qp.put("url", urlVal);
> Configuration conf = NutchConfiguration.create();
> NutchBean bean = new NutchBean(conf);
> Query query = new Query(conf);
> query.setParams(qp);
> Hits hits = bean.search(query);
>
> Didn't get anything.
>
>
> Is there someone who can give me a quick example of how this could be done?
>
>
>
> ________________________________________
> From: CatOs Mandros [[email protected]]
> Sent: Tuesday, August 24, 2010 4:10 AM
> To: [email protected]
> Subject: Re: find segment for an url
>
> Hi Henry,
>
> If i'm not mistaken, the correct way to handle this is to query your
> index . It should have the information about what segment is the URL
> located. Then you should only have to run your code on the segment
> returned to get the content.
>
>
> On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger
> <[email protected]> wrote:
>> I want to loop through URLs which have been crawled / indexed.
>>
>> I have a (known) subset of URLs that I want to get the (raw) content for
>>
>> if I know the segment, I can do something like this:
>> String segName = "20100817162607";
>> String url = "http://adomain.com/awebappOfInterest/someContent.do";
>>
>> HitDetails detail = new HitDetails(segName, url);
>> Configuration conf = NutchConfiguration.create();
>>
>> NutchBean bean = new NutchBean(conf);
>>
>> byte[] contentBytes = bean.getContent(detail);
>> for (byte b : contentBytes)
>> {
>> System.out.print((char)b);
>> }
>>
>> My question is, given, a known Url, how can I find what segment it is in? Is
>> there something in the API for giving an URL and getting back the name of
>> the segment it is found in?
>>
>> regards,
>> -henry
>> [email protected]
>>
>> InfoNow Corporation | This communication, including attachments, is for
>> the exclusive use of addressee and may contain proprietary, confidential or
>> privileged information.
>>
>
>
> InfoNow Corporation | This communication, including attachments, is for the
> exclusive use of addressee and may contain proprietary, confidential or
> privileged information.
>
InfoNow Corporation | This communication, including attachments, is for the
exclusive use of addressee and may contain proprietary, confidential or
privileged information.