Re: Custom elastic indexer in nutch

2016-11-07 Thread Sachin Shaju
One elaborated answer to the same :
http://stackoverflow.com/questions/40418712/adding-custom-fields-and-types-in-nutch-elastic-indexer/40423485#40423485

Regards,
Sachin Shaju

sachi...@mstack.com

On Fri, Nov 4, 2016 at 2:35 PM, Sachin Shaju <sachi...@mstack.com> wrote:

> Hi,
>
>  I was running test runs on nutch elastic indexer.I would like to add
> some custom fields and custom typenames(instead of "doc") that can be given
> as arguments to the indexing job. I understand *NutchDocument* is the
> class which is responsible for setting field names and metadata but
> couldn't figure out where nutch create instance of this and sets values. Or
> Is there any other way for this. Please help.
>
> Regards,
> Sachin Shaju
>
> sachi...@mstack.com
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Custom elastic indexer in nutch

2016-11-06 Thread Sachin Shaju
How to do the same with index.parse.md ? Any useful links or demonstration
please.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Sat, Nov 5, 2016 at 8:49 PM, MrSrivastavaRK . <srivastav...@gmail.com>
wrote:

> I am facing same problem. Thought of to share some work around,  you can
> add in configuration during crawl request and retrieve same when indexer
> job start.
>
> On Nov 5, 2016 5:00 PM, "Markus Jelsma" <markus.jel...@openindex.io>
> wrote:
>
> > Hi - If you want to index some custom fields, you can either use
> > index.parse.md or create a custom indexing filter plugin.
> > Markus
> >
> > -Original message-
> > > From:Sachin Shaju <sachi...@mstack.com>
> > > Sent: Friday 4th November 2016 10:05
> > > To: user@nutch.apache.org
> > > Subject: Custom elastic indexer in nutch
> > >
> > > Hi,
> > >
> > >  I was running test runs on nutch elastic indexer.I would like to
> add
> > > some custom fields and custom typenames(instead of "doc") that can be
> > given
> > > as arguments to the indexing job. I understand *NutchDocument* is the
> > class
> > > which is responsible for setting field names and metadata but couldn't
> > > figure out where nutch create instance of this and sets values. Or Is
> > there
> > > any other way for this. Please help.
> > >
> > > Regards,
> > > Sachin Shaju
> > >
> > > sachi...@mstack.com
> > >
> > > --
> > >
> > >
> > > The information contained in this electronic message and any
> attachments
> > to
> > > this message are intended for the exclusive use of the addressee(s) and
> > may
> > > contain proprietary, confidential or privileged information. If you are
> > not
> > > the intended recipient, you should not disseminate, distribute or copy
> > this
> > > e-mail. Please notify the sender immediately and destroy all copies of
> > this
> > > message and any attachments.
> > >
> > > WARNING: Computer viruses can be transmitted via email. The recipient
> > > should check this email and any attachments for the presence of
> viruses.
> > > The company accepts no liability for any damage caused by any virus
> > > transmitted by this email.
> > >
> > > www.mStack.com
> > >
> >
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Nutch as a service

2016-10-07 Thread Sachin Shaju
Hi Furkan,
 I've checked giving null for args. It didn't work either.
After investigating source code of *Fetcher.java* I've figured out it is
looking for segment in local path if a segment option is not added. If
segment option is added as a valid segment in hdfs it will work. I've
resolved that issue by returning segment path from generate phase in
results JSON in generate rest call. Added one or two lines in source code
of *Generator.java* file and it works. Am not sure if this is the way to do
this. But still it works.  Please write to me if there is any better option.

Everything works until index phase. Indexing to elasticsearch is failing by
throwing an unknown exception. Please have a look at
http://www.mail-archive.com/user%40nutch.apache.org/msg15001.html

Regards,
Sachin Shaju

sachi...@mstack.com

On Thu, Oct 6, 2016 at 10:12 PM, Furkan KAMACI <furkankam...@gmail.com>
wrote:

> Hi Sachin,
>
> Could you check it again with sending *null* instead of *{}* ?
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Oct 6, 2016 at 7:20 AM, Sachin Shaju <sachi...@mstack.com> wrote:
>
> > Hi Sujen,
> >   Thanks for the reply. Actually that stackoverflow post was
> > created by me itself. :) I have some more queries.
> >  1. Do I have to run the server on hadoop namenode itself ?
> >  2. I have tested nutch server in hadoop. But on *fetch phase* it is
> > encountering *NullPointer* exception. That I can post here.
> > 16/10/05 18:53:59 ERROR impl.JobWorker: Cannot run job worker!
> >
> > java.lang.NullPointerException
> > at java.util.Arrays.sort(Arrays.java:1438)
> > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:564)
> > at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> > I've checked source code. It is due to the absence of a parameter segment
> > in REST call for fetch. I'm expecting it to pick the latest segment
> > automatically. But it is not working that way.
> >
> > The request I've used is :-
> >
> > *POST /job/create*
> > *{   *
> > *"type":"FETCH",*
> > *"confId":"news",*
> > *"crawlId":"crawl001",*
> > *"args": {}*
> > *}*
> >
> > Am I missing anything here ?
> >
> >
> >
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > On Thu, Oct 6, 2016 at 5:03 AM, Sujen Shah <sujen1...@gmail.com> wrote:
> >
> > > Hi Sachin,
> > >
> > > Nutch REST API is built using Apache CXF framework and JAX-RS. The
> Nutch
> > > Server uses an embedded Jetty Server to service the http requests.
> > > You can find out more about CXF and Jetty here (
> > > http://cxf.apache.org/docs/overview.html).
> > >
> > > The server runs on one machine waiting for http requests. Once a
> request
> > is
> > > received it will start the respective Nutch Job requested (which might
> be
> > > distributed ex- fetch job)
> > >
> > >
> > > Just for visibility on the user list, this question was asked on
> > > stackoverflow. Link to the question and follow up discussion can be
> found
> > > at -
> > > http://stackoverflow.com/questions/39853492/working-of-
> > > nutch-server-in-distributed-mode
> > >
> > > Thanks
> > > Sujen
> > >
> > >
> > >
> > > Regards,
> > > Sujen Shah
> > > M.S - Computer Science
> > > University of Southern California
> > > http://www.linkedin.com/in/sujenshah
> > >
> > > On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju <sachi...@mstack.com>
> > wrote:
> > >
> > > > Hi,
> > > > I would like to know how nutch server works actually? Whether it
> > use
> > > a
> > > > listener for incoming crawl requests or it is a continuously running
> > > > server?
> > > > Regards,
> > > > Sachin Shaju
> > > >
> > > > sachi...@mstack.com
> > > >
> > > > --
> > > >
> > > >
> > > > The information contained in this electronic message and any
> > attachments
> > > to
> > > > this message are intended for the exclus

Unknown issue in Nutch indexer with REST api

2016-10-07 Thread Sachin Shaju
Hi,
I was trying to expose nutch using REST endpoints and ran into an issue
in indexer phase. I'm using elasticsearch index writer to index docs to ES.
I've used $NUTCH_HOME/runtime/deploy/bin/nutch startserver command. While
indexing an unknown exception is thrown.

Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:47 INFO mapreduce.Job:  map 100% reduce 0%
16/10/07 16:01:49 INFO mapreduce.Job: Task Id :
attempt_1475748314769_0107_r_00_1, Status : FAILED
Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:53 INFO mapreduce.Job: Task Id :
attempt_1475748314769_0107_r_00_2, Status : FAILED
Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:58 INFO mapreduce.Job:  map 100% reduce 100%
16/10/07 16:01:59 INFO mapreduce.Job: Job job_1475748314769_0107 failed
with state FAILED due to: Task failed task_1475748314769_0107_r_00
Job failed as tasks failed. failedMaps:0 failedReduces:1

ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

Failed with exit code 255.

Any help would be appreciated.

PS : After debugging using stack trace I think the issue is due to mismatch
in guava version. I've tried changing build.xml of plugins(parse-tika and
parsefilter-naivebayes) but it didn't work.


Regards,
Sachin Shaju

sachi...@mstack.com

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Nutch as a service

2016-10-05 Thread Sachin Shaju
Hi Sujen,
  Thanks for the reply. Actually that stackoverflow post was
created by me itself. :) I have some more queries.
 1. Do I have to run the server on hadoop namenode itself ?
 2. I have tested nutch server in hadoop. But on *fetch phase* it is
encountering *NullPointer* exception. That I can post here.
16/10/05 18:53:59 ERROR impl.JobWorker: Cannot run job worker!

java.lang.NullPointerException
at java.util.Arrays.sort(Arrays.java:1438)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:564)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I've checked source code. It is due to the absence of a parameter segment
in REST call for fetch. I'm expecting it to pick the latest segment
automatically. But it is not working that way.

The request I've used is :-

*POST /job/create*
*{   *
*"type":"FETCH",*
*"confId":"news",*
*"crawlId":"crawl001",*
*    "args": {}*
*}*

Am I missing anything here ?




Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Thu, Oct 6, 2016 at 5:03 AM, Sujen Shah <sujen1...@gmail.com> wrote:

> Hi Sachin,
>
> Nutch REST API is built using Apache CXF framework and JAX-RS. The Nutch
> Server uses an embedded Jetty Server to service the http requests.
> You can find out more about CXF and Jetty here (
> http://cxf.apache.org/docs/overview.html).
>
> The server runs on one machine waiting for http requests. Once a request is
> received it will start the respective Nutch Job requested (which might be
> distributed ex- fetch job)
>
>
> Just for visibility on the user list, this question was asked on
> stackoverflow. Link to the question and follow up discussion can be found
> at -
> http://stackoverflow.com/questions/39853492/working-of-
> nutch-server-in-distributed-mode
>
> Thanks
> Sujen
>
>
>
> Regards,
> Sujen Shah
> M.S - Computer Science
> University of Southern California
> http://www.linkedin.com/in/sujenshah
>
> On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju <sachi...@mstack.com> wrote:
>
> > Hi,
> > I would like to know how nutch server works actually? Whether it use
> a
> > listener for incoming crawl requests or it is a continuously running
> > server?
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> >
> > --
> >
> >
> > The information contained in this electronic message and any attachments
> to
> > this message are intended for the exclusive use of the addressee(s) and
> may
> > contain proprietary, confidential or privileged information. If you are
> not
> > the intended recipient, you should not disseminate, distribute or copy
> this
> > e-mail. Please notify the sender immediately and destroy all copies of
> this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-05 Thread Sachin Shaju
For the time being you can comment out this line -^.{513,}$ and check.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Wed, Oct 5, 2016 at 11:41 AM, shubham.gupta <shubham.gu...@orkash.com>
wrote:

> my current regex-urlfilter properties are as follows:
>
> # skip file: ftp: and mailto: urls
> #-^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> #-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
> wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> -^(http://up.anv.bz)
> +.
>
> # skip URLs longer than 512 characters
> -^.{513,}$
>
> Thanks and Regards,
> Shubham Gupta
>
> On Wednesday 05 October 2016 11:29 AM, Sachin Shaju wrote:
>
>> my regex-urlfilter properties are as follows:
>> >>>>
>> >>>># skip file: ftp: and mailto: urls
>> >>>>-^(file|ftp|mailto):
>> >>>>
>> >>>># skip image and other suffixes we can't yet parse
>> >>>># for a more extensive coverage use the urlfilter-suffix plugin
>> >>>>-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
>> >>>>wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
>> >>>>tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>> >>>>
>> >>>># skip URLs containing certain characters as probable queries, etc.
>> >>>>#-[?*!@=]
>> >>>>
>> >>>># skip URLs with slash-delimited segment that repeats 3+ times, to
>> break
>> >>>>loops
>> >>>>-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> >>>>
>> >>>># accept anything else
>> >>>>-^(http://up.anv.bz)
>> >>>>+.
>> >>>>
>> >>>># skip URLs longer than 512 characters
>> >>>>-^.{513,}$
>>
>
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: 90% of URL rejected by filtering (Nutch 2.3.1)

2016-10-05 Thread Sachin Shaju
Hi,
Can you share your current regex-urlfilter file ?

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Wed, Oct 5, 2016 at 11:19 AM, shubham.gupta <shubham.gu...@orkash.com>
wrote:

> The problem is not yet solved.
>
> Thanks and Regards
> Shubham Gupta
>
> On Monday 03 October 2016 11:12 AM, shubham.gupta wrote:
>
>> After doing this 3 less URLs have been rejected.
>>
>> Thanks and Regards,
>> Shubham Gupta
>>
>> On Monday 03 October 2016 10:28 AM, Sachin Shaju wrote:
>>
>>> You may check by commenting all regex filters in url-filter file and can
>>> try +. to see whether it gives the same output.
>>>
>>> Regards,
>>> Sachin Shaju
>>>
>>> sachi...@mstack.com
>>>
>>> On Mon, Oct 3, 2016 at 10:05 AM, shubham.gupta <shubham.gu...@orkash.com
>>> >
>>> wrote:
>>>
>>> Hey
>>>>
>>>> When the inject job is run 90% of my seedurls get rejected. Therefore,
>>>> very few urls get crawled and does not give proper outputs.
>>>>
>>>> my regex-urlfilter properties are as follows:
>>>>
>>>> # skip file: ftp: and mailto: urls
>>>> -^(file|ftp|mailto):
>>>>
>>>> # skip image and other suffixes we can't yet parse
>>>> # for a more extensive coverage use the urlfilter-suffix plugin
>>>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
>>>> wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
>>>> tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>>>>
>>>> # skip URLs containing certain characters as probable queries, etc.
>>>> #-[?*!@=]
>>>>
>>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>>> loops
>>>> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>>
>>>> # accept anything else
>>>> -^(http://up.anv.bz)
>>>> +.
>>>>
>>>> # skip URLs longer than 512 characters
>>>> -^.{513,}$
>>>>
>>>> --
>>>>
>>>> Shubham Gupta
>>>>
>>>>
>>>>
>>
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Nutch as a service

2016-10-04 Thread Sachin Shaju
Hi,
I would like to know how nutch server works actually? Whether it use a
listener for incoming crawl requests or it is a continuously running
server?
Regards,
Sachin Shaju

sachi...@mstack.com

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Nutch in production

2016-09-29 Thread Sachin Shaju
Can I have a link to this ?

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

On Thu, Sep 29, 2016 at 11:13 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Yep also check out the work that Sujen Shah just merged (also on my team
> at JPL and
> USC) where you can publish events to an ActiveMQ queue from Nutch
> crawling. That
> should allow all sorts of production dashboards and analytics.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect, Instrument Software and Science Data Systems Section (398)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++
>
>
> On 9/29/16, 10:41 AM, "Karanjeet Singh" <karan...@usc.edu> wrote:
>
> Hi Sachin,
>
> Just a suggestion here - you can use Apache Kafka to generate and catch
> events which are mapped to incoming crawl requests, crawl status and
> much
> more.
>
> I have created a prototype for production queue [0] which runs on top
> of a
> supercomputer (TACC Wrangler) and integrated it with Kafka. Please
> have a
> look and let me know if you have any questions.
>
> [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler
>
> P.S. - There can be many solutions to this. I am just giving one.  :)
>
> Regards,
> Karanjeet Singh
> http://irds.usc.edu
>
> On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <sachi...@mstack.com>
> wrote:
>
> > Hi,
> >I was experimenting some crawl cycles with nutch and would like
> to setup
> > a distributed crawl environment. But I wonder how can I trigger
> nutch for
> > incoming crawl requests in a production system. I read about nutch
> REST
> > api. Is that the real option that I have ? Or can I run nutch as a
> > continuously running distributed server by any other option ?
> >
> >  My preferred nutch version is nutch 1.12.
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > --
> >
> >
> > The information contained in this electronic message and any
> attachments to
> > this message are intended for the exclusive use of the addressee(s)
> and may
> > contain proprietary, confidential or privileged information. If you
> are not
> > the intended recipient, you should not disseminate, distribute or
> copy this
> > e-mail. Please notify the sender immediately and destroy all copies
> of this
> > message and any attachments.
> >
> > WARNING: Computer viruses can be transmitted via email. The recipient
> > should check this email and any attachments for the presence of
> viruses.
> > The company accepts no liability for any damage caused by any virus
> > transmitted by this email.
> >
> > www.mStack.com
> >
>
> ᐧ
>
>
>

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Custom options in nutch crawl script

2016-09-29 Thread Sachin Shaju
I was trying to give custom options in *bin/crawl* script and encountered
an issue. I gave a custom config in nutch to ignore external outlinks in my
crawl command like :-

*bin/crawl -i -D elastic.index=test -D db.ignore.external.links=true urls/
CrawlTest/ 3*

But this is not working. Then I set this property in nutch-site.xml then it
is working.

Then I tried to set a custom config to index data to a specific elastic
index other than what is given in nutch-site.xml as java option in
bin/crawl. To my surprise it is working.
The command I've used :-

*bin/crawl -i -D elastic.index=test urls/ CrawlTest/ 3*

So I would like to know why my first command didn't work ?Am I missing
anything. Please help.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Nutch in production

2016-09-29 Thread Sachin Shaju
Hi,
   I was experimenting some crawl cycles with nutch and would like to setup
a distributed crawl environment. But I wonder how can I trigger nutch for
incoming crawl requests in a production system. I read about nutch REST
api. Is that the real option that I have ? Or can I run nutch as a
continuously running distributed server by any other option ?

 My preferred nutch version is nutch 1.12.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


How to run nutch server on distributed environment

2016-09-29 Thread Sachin Shaju
Hi,

I have tested running of nutch in server mode by starting it using
bin/nutch startserver command*locally*. Now I wonder whether I can start
nutch in *server mode* on top of a hadoop cluster(in distributed
environment) and submit crawl requests to server using nutch REST api ?
Please help.

Regards,
Sachin Shaju

sachi...@mstack.com
+919539887554

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com