Thank you Lewis!
About second question
db
{
"batchId": "batch-id"
}
I replaced batch-id with value from batchId from database.
It doesn't work.
Regards,
Vladimir.
-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org]
Sent: November-15-16 11:53 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.3.1 REST calls to DB
Hi Vladimir,
Responses inline
On Thu, Nov 10, 2016 at 1:05 AM, <user-digest-h...@nutch.apache.org> wrote:
> From: Vladimir Loubenski <vloub...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned following syntax for DB
> calls
> :https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_
> nutch_NutchRESTAPI=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6
> IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIk
> w5q8=1ISpV-kF4K4uFOgWvbrhzK_gkRhK13HECdHSlV7eB9Q=
>
> 1. What does mean "startKey", "endKey" and "isKeysReversed" ?
> POST /db
>{
> "startKey":"com.google",
> "endKey":"com.yahoo",
> "isKeysReversed":"true"
>}
>
Well essentially you are running a DB query here this is because we are
attempting to obtain data from one of the Gora supported databases. If you wish
to read the code then please see
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_resources_DbResource.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=FUPZpGNJHrdxDxRqoU6QBE8Utbmgzsoku0ihKlDtAKk=
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_model_request_DbFilter.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=GXRDR3MKOpzJFD8vtAUnIaNkrjtOUt_ChAX974hFoDQ=
In this case you are setting a start key, and an end key from which to scan and
for which to return a results Iterator. Please note, that right now we do not
have consistency in the way that start keys or end keys are inclusive or not
within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to the
improvements this offers us in terms of query and scan performance.
Lets take an example
'org.apache.nutch...'
This means that we can scan initially for 'org' then 'apache' meaning that we
are scanning a significantly reduced subset of the data contained within the
WebGraph DB. On the other hand lets consider the following
'https://urldefense.proofpoint.com/v2/url?u=http-3A__nutch.apache.org=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=AZDsV2XWUToctvsYiXTgWfQZmm4W3Ehpb1EbO7LtFbc=
...'
This would mean that we query by 'http://', then 'nutch'
The issue with querying for 'http://' is that more or less EVERY key within the
DB would contain 'http://' meaning that our path to query is significantly
increased and our query is not going to be very efficient at all.
>
> Call bellow doesn't work for me. It always return empty result POST
> /db
>{
> "batchId": "batch-id"
>}
>
>
Please ensure that you have replaced the right hand side "batch-id" value with
the value of one of your BatchID identifiers. These are created at the generate
phase of a crawl cycle. In order to obtain a list of all BatchID's you've
created, you would need to query your Database separately outside of Nutch and
create a list of BatchID results.
hth
Lewis