RE: Nutch 2.3.1 REST calls to DB

2016-11-15 Thread Vladimir Loubenski
Thank you Lewis!
About second question 
db
   {
   "batchId": "batch-id"
}
I replaced  batch-id with value from batchId from database.
It doesn't work.
Regards,
Vladimir.

-Original Message-
From: lewis john mcgibbney [mailto:lewi...@apache.org] 
Sent: November-15-16 11:53 AM
To: user@nutch.apache.org
Subject: Re: Nutch 2.3.1 REST calls to DB

Hi Vladimir,
Responses inline

On Thu, Nov 10, 2016 at 1:05 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Vladimir Loubenski <vloub...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned  following syntax for DB 
> calls  
> :https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_
> nutch_NutchRESTAPI=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6
> IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIk
> w5q8=1ISpV-kF4K4uFOgWvbrhzK_gkRhK13HECdHSlV7eB9Q=
>
> 1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
> POST /db
>{
>   "startKey":"com.google",
>   "endKey":"com.yahoo",
>   "isKeysReversed":"true"
>}
>

Well essentially you are running a DB query here this is because we are 
attempting to obtain data from one of the Gora supported databases. If you wish 
to read the code then please see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_resources_DbResource.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=FUPZpGNJHrdxDxRqoU6QBE8Utbmgzsoku0ihKlDtAKk=
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_nutch_blob_2.x_src_java_org_apache_nutch_api_model_request_DbFilter.java=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=GXRDR3MKOpzJFD8vtAUnIaNkrjtOUt_ChAX974hFoDQ=
In this case you are setting a start key, and an end key from which to scan and 
for which to return a results Iterator. Please note, that right now we do not 
have consistency in the way that start keys or end keys are inclusive or not 
within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one 
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to the 
improvements this offers us in terms of query and scan performance.
Lets take an example

'org.apache.nutch...'

This means that we can scan initially for 'org' then 'apache' meaning that we 
are scanning a significantly reduced subset of the data contained within the 
WebGraph DB. On the other hand lets consider the following

'https://urldefense.proofpoint.com/v2/url?u=http-3A__nutch.apache.org=DgIBaQ=ZgVRmm3mf2P1-XDAyDsu4A=Go-zk3wwFXw3zk6IKI5viJn9Qf3N2dP8AA11tevsqfk=0aqTXsfOLc7NFG7w0gMckaZS3ZMzQdBXIgBpVIkw5q8=AZDsV2XWUToctvsYiXTgWfQZmm4W3Ehpb1EbO7LtFbc=
 ...'

This would mean that we query by 'http://', then 'nutch'

The issue with querying for 'http://' is that more or less EVERY key within the 
DB would contain 'http://' meaning that our path to query is significantly 
increased and our query is not going to be very efficient at all.



>
> Call bellow doesn't work for me. It always return empty result POST 
> /db
>{
>   "batchId": "batch-id"
>}
>
>
Please ensure that you have replaced the right hand side "batch-id" value with 
the value of one of your BatchID identifiers. These are created at the generate 
phase of a crawl cycle. In order to obtain a list of all BatchID's you've 
created, you would need to query your Database separately outside of Nutch and 
create a list of BatchID results.
hth
Lewis


Re: Nutch 2.3.1 REST calls to DB

2016-11-15 Thread lewis john mcgibbney
Hi Vladimir,
Responses inline

On Thu, Nov 10, 2016 at 1:05 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: Vladimir Loubenski <vloub...@opentext.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +0000
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned  following syntax for DB
> calls  :https://wiki.apache.org/nutch/NutchRESTAPI
>
> 1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
> POST /db
>{
>   "startKey":"com.google",
>   "endKey":"com.yahoo",
>   "isKeysReversed":"true"
>}
>

Well essentially you are running a DB query here this is because we are
attempting to obtain data from one of the Gora supported databases. If you
wish to read the code then please see
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/resources/DbResource.java
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/model/request/DbFilter.java
In this case you are setting a start key, and an end key from which to scan
and for which to return a results Iterator. Please note, that right now we
do not have consistency in the way that start keys or end keys are
inclusive or not within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to
the improvements this offers us in terms of query and scan performance.
Lets take an example

'org.apache.nutch...'

This means that we can scan initially for 'org' then 'apache' meaning that
we are scanning a significantly reduced subset of the data contained within
the WebGraph DB. On the other hand lets consider the following

'http://nutch.apache.org...'

This would mean that we query by 'http://', then 'nutch'

The issue with querying for 'http://' is that more or less EVERY key within
the DB would contain 'http://' meaning that our path to query is
significantly increased and our query is not going to be very efficient at
all.



>
> Call bellow doesn't work for me. It always return empty result
> POST /db
>{
>   "batchId": "batch-id"
>}
>
>
Please ensure that you have replaced the right hand side "batch-id" value
with the value of one of your BatchID identifiers. These are created at the
generate phase of a crawl cycle. In order to obtain a list of all BatchID's
you've created, you would need to query your Database separately outside of
Nutch and create a list of BatchID results.
hth
Lewis


Nutch 2.3.1 REST calls to DB

2016-11-08 Thread Vladimir Loubenski
Hi,
Nutch 2.x REST API documentation is mentioned  following syntax for DB calls  
:https://wiki.apache.org/nutch/NutchRESTAPI

1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
POST /db
   {
  "startKey":"com.google",
  "endKey":"com.yahoo",
  "isKeysReversed":"true"
   }

Call bellow doesn't work for me. It always return empty result   
POST /db
   {
  "batchId": "batch-id"
   }


Thank you in advanse.
Vladimir.