RE: find segment for an url

Henry Noerdlinger Tue, 24 Aug 2010 14:32:29 -0700

Thank you,

That what I had already begun to look into. At some point in my process, a 
configuration will need to be set which will define search that will identify a 
specific set of urls to get content from.
I was planning on creating some code to sift through readlinkdb results.


My solution is now like this:

      //this is a pattern which identifies a set of urls
      String term = "partnerdetails";
      String field = "url";
      Configuration conf = NutchConfiguration.create();
      NutchBean bean = new NutchBean(conf);
      Query query = new Query(conf);
      query.addRequiredTerm(term,field);
      Hits hits = bean.search(query);
      for (int i = 0; i < hits.getLength(); i++)
      {

         Hit hit = hits.getHit(i);
         HitDetails detail = bean.getDetails(hit);
         byte[] contentBytes = bean.getContent(detail);
         StringBuilder content = new StringBuilder();
         for (byte b : contentBytes)
         {
            content.append((char)b);
         }
   }

________________________________________
From: CatOs Mandros [[email protected]]
Sent: Tuesday, August 24, 2010 1:52 PM
To: [email protected]
Subject: Re: find segment for an url

If I were you I would use Luke ( http://code.google.com/p/luke/ ) to
examine what data do you have on your indexes if you're using lucene
indexes :)

On Tue, Aug 24, 2010 at 6:21 PM, Henry Noerdlinger
<[email protected]> wrote:
> Thank you for response.
>
> I ran a simple test where I constructed a QueryParams object and have field / 
> value of "url" and "http://blahblah.com/";
> and then added this to a Query object and passed this to my beloved NutchBean 
> to search for like this:
>  String urlVal = "http://domain.com/webapp/content.do";;
>      QueryParams qp = new QueryParams();
>      qp.put("url", urlVal);
>      Configuration conf = NutchConfiguration.create();
>      NutchBean bean = new NutchBean(conf);
>      Query query = new Query(conf);
>      query.setParams(qp);
>      Hits hits = bean.search(query);
>
> Didn't get anything.
>
>
> Is there someone who can give me a quick example of how this could be done?
>
>
>
> ________________________________________
> From: CatOs Mandros [[email protected]]
> Sent: Tuesday, August 24, 2010 4:10 AM
> To: [email protected]
> Subject: Re: find segment for an url
>
> Hi Henry,
>
> If i'm not mistaken, the correct way to handle this is to query your
> index . It should have the information about what segment is the URL
> located. Then you should only have to run your code on the segment
> returned to get the content.
>
>
> On Tue, Aug 24, 2010 at 12:24 AM, Henry Noerdlinger
> <[email protected]> wrote:
>> I want to loop through URLs which have been crawled / indexed.
>>
>> I have a (known) subset of URLs that I want to get the (raw) content for
>>
>> if I know the segment, I can do something like this:
>>      String segName = "20100817162607";
>>      String url = "http://adomain.com/awebappOfInterest/someContent.do";;
>>
>>      HitDetails detail = new HitDetails(segName, url);
>>      Configuration conf = NutchConfiguration.create();
>>
>>      NutchBean bean = new NutchBean(conf);
>>
>>      byte[] contentBytes = bean.getContent(detail);
>>      for (byte b : contentBytes)
>>      {
>>         System.out.print((char)b);
>>      }
>>
>> My question is, given, a known Url, how can I find what segment it is in? Is 
>> there something in the API for giving an URL and getting back the name of 
>> the segment it is found in?
>>
>> regards,
>> -henry
>> [email protected]
>>
>> InfoNow Corporation  |  This communication, including attachments, is for 
>> the exclusive use of addressee and may contain proprietary, confidential or 
>> privileged information.
>>
>
>
> InfoNow Corporation  |  This communication, including attachments, is for the 
> exclusive use of addressee and may contain proprietary, confidential or 
> privileged information.
>


InfoNow Corporation  |  This communication, including attachments, is for the 
exclusive use of addressee and may contain proprietary, confidential or 
privileged information.

RE: find segment for an url

Reply via email to