Thank you Lewis!
About second question 
db
   {
       "batchId": "batch-id"
    }
I replaced  batch-id with value from batchId from database.
It doesn't work.
Regards,
Vladimir.

-----Original Message-----
From: lewis john mcgibbney [mailto:[email protected]] 
Sent: November-15-16 11:53 AM
To: [email protected]
Subject: Re: Nutch 2.3.1 REST calls to DB

Hi Vladimir,
Responses inline

On Thu, Nov 10, 2016 at 1:05 AM, <[email protected]> wrote:

> From: Vladimir Loubenski <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +0000
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned  following syntax for DB 
> calls  
> :https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_
> nutch_NutchRESTAPI&d=DgIBaQ&c=ZgVRmm3mf2P1-XDAyDsu4A&r=Go-zk3wwFXw3zk6
> IKI5viJn9Qf3N2dP8AA11tevsqfk&m=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIk
> w5q8&s=1ISpV-kF4K4uFOgWvbrhzK_gkRhK13HECdHSlV7eB9Q&e=
>
> 1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
> POST /db
>    {
>       "startKey":"com.google",
>       "endKey":"com.yahoo",
>       "isKeysReversed":"true"
>    }
>

Well essentially you are running a DB query here this is because we are 
attempting to obtain data from one of the Gora supported databases. If you wish 
to read the code then please see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_resources_DbResource.java&d=DgIBaQ&c=ZgVRmm3mf2P1-XDAyDsu4A&r=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk&m=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8&s=FUPZpGNJHrdxDxRqoU6QBE8Utbmgzsoku0ihKlDtAKk&e=
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_model_request_DbFilter.java&d=DgIBaQ&c=ZgVRmm3mf2P1-XDAyDsu4A&r=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk&m=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8&s=GXRDR3MKOpzJFD8vtAUnIaNkrjtOUt_ChAX974hFoDQ&e=
In this case you are setting a start key, and an end key from which to scan and 
for which to return a results Iterator. Please note, that right now we do not 
have consistency in the way that start keys or end keys are inclusive or not 
within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one 
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to the 
improvements this offers us in terms of query and scan performance.
Lets take an example

'org.apache.nutch...'

This means that we can scan initially for 'org' then 'apache' meaning that we 
are scanning a significantly reduced subset of the data contained within the 
WebGraph DB. On the other hand lets consider the following

'https://urldefense.proofpoint.com/v2/url?u=http-3A__nutch.apache.org&d=DgIBaQ&c=ZgVRmm3mf2P1-XDAyDsu4A&r=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk&m=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8&s=AZDsV2XWUToctvsYiXTgWfQZmm4W3Ehpb1EbO7LtFbc&e=
 ...'

This would mean that we query by 'http://', then 'nutch'

The issue with querying for 'http://' is that more or less EVERY key within the 
DB would contain 'http://' meaning that our path to query is significantly 
increased and our query is not going to be very efficient at all.



>
> Call bellow doesn't work for me. It always return empty result POST 
> /db
>    {
>       "batchId": "batch-id"
>    }
>
>
Please ensure that you have replaced the right hand side "batch-id" value with 
the value of one of your BatchID identifiers. These are created at the generate 
phase of a crawl cycle. In order to obtain a list of all BatchID's you've 
created, you would need to query your Database separately outside of Nutch and 
create a list of BatchID results.
hth
Lewis

Reply via email to