Hi ,

I have did a POC for indexing the content in ES using Nutch 1.12 .. See
REST API details. I executed  bin/nutch startserver --port 9090 in local
mode By default nutch will create folders in bin directory for each crawl
request based on crawlid parameter.

POST /config/create
{
      "configId":"ereader",
      "force":"true",
      "params":{"http.agent.name":"elasticnutchrest",
                "http.robots.agents":"elasticnutchrest",
                "http.timeout":"1000000",
"plugin.includes":">protocol-httpclient|urlfilter-regex|parse-(text|tika|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic",
"index.metadata":"title,content",
"index.parse.md": "metatag.title,metatag.content",
"elastic.host":"localhost"

}
   }

POST seed/create/
{
    "id": "101",
    "name": "ereader",
    "seedUrls": [
        {
            "id": "1",
            "url": "​https://www.example.com";
        }
    ]
}

POST job/create  -- Inject
{
"args": {
         "url_dir": "/tmp/1475832312548-0"

    },
    "confId": "default",
    "crawlId": "crawl01",
    "type": "INJECT"
}




---GENERATE
{
    "type":"GENERATE",
    "confId": "default",
    "crawlId": "crawl01",
    "args": {
          "segments_dir":"/bin/crawl01/segments"

    }
}


---Fetch

{
    "args": {
      "segment_dir":"/bin/crawl01/segments/20161007152133",     //input path
      "threads":"50"
    },
    "confId": "default",
    "crawlId": "crawl01",
    "type": "FETCH"
}



---PARSE
{
    "args": {
      "segment_dir":"/bin/crawl01/segments/20161007152133",  //input path
      "threads":"50"
    },
    "confId": "default",
    "crawlId": "crawl01",
    "type": "PARSE"
}


---UpdateDB

{
    "args": {

 
"segment_dir":"/home/osboxes/trunk/runtime/local/bin/crawl01/segments/20161007152133"
   //full input path
    },
    "confId": "default",
    "crawlId": "­crawl01",
    "type": "UPDATEDB"
}


---Index
{
    "args": {

 
"segment_dir":"/home/osboxes/trunk/runtime/local/bin/crawl01/segments/20161007152133"
   //full input path
    },
    "confId": "default",
    "crawlId": "­crawl01",
    "type": "INDEX"
}



Hope this help. let me know, if need any clarification.


On Tue, Oct 11, 2016 at 8:08 PM, Sujen Shah <[email protected]> wrote:

> Hi
> You could find the rest api documentation for Nutch 1.x here
> https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI and for
> Nutch 2.X here
> https://wiki.apache.org/nutch/NutchRESTAPI
>
> I am in the process of reviewing and updating it if any thing is
> inconsistent, there have been changes in Nutch 1.x rest service since its
> under active development.
>
> It'd be great if you could give it a try and report any issues.
>
> Thank you!
>
> Regards,
> Sujen Shah
>
> On Oct 11, 2016 7:09 AM, "WebDawg" <[email protected]> wrote:
>
> > I would please very much like this.
> >
> > I was thinking about talking to the devs eventually, the documentation
> > seems non existent.
> >
> > I suppose it is reading the source/working with that is there?
> >
> > On Mon, Oct 10, 2016 at 11:33 PM, MrSrivastavaRK .
> > <[email protected]> wrote:
> > > Hi,
> > > I have successfully indexed content in Elasticsearch using Nutch 1.12
> > REST
> > > API. I can send you api details, If you want for reference.
> > >
> > > Regards
> > > Rajeev
> > >
> > > On Oct 10, 2016 11:31 PM, "WebDawg" <[email protected]> wrote:
> > >
> > >> Hello,
> > >>
> > >> I successfully have webapp and nutchserver running and I would like to
> > >> know more about the API and if it is functional.
> > >>
> > >> I am trying to hack into it and wonder what the relationship between
> > >> the different config urls and configs.
> > >>
> > >> Any help on this?  I would like to figure out how this works.  Does it
> > >> reference all the files in the conf dir?  If I do a crawl is it the
> > >> same as executing a crawl via command line?
> > >>
> >
>



-- 
Regards
Rajeev K. Srivastava

Reply via email to