RE: nutch 1.12 How can I force a URL to get re-indexed

2016-10-07 Thread Markus Jelsma
Hello - this sounds fine indeed. But i don't know what happens with the 
calculation of the next fetch time when adddays is used, i've never tried. You 
may want to confirm fetch time is not affected by this.

Markus


-Original message-
> From:Sujan Suppala 
> Sent: Friday 7th October 2016 14:11
> To: user@nutch.apache.org
> Subject: RE: nutch 1.12 How can I force a URL to get re-indexed
> 
> Thanks Markus.
> 
> I can not use freegen as this tool is not available via REST api.
> 
> With the combination of -adddays and -expr options of generator I achieved my 
> requirement. 
> Here is what I did:
> 1. inject the urls with some metadata say pageId=
>   Seed file contains the below entry:
>   http://localhost:9090/nutchsite/html/page1.html pageId=
> 
> 2. now issue the generate command with the -adddays(to make all the urls to 
> be due for fetch) and -expr(to filter out the urls) options to select only 
> the urls to be fetched again as below:
>   $ bin/nutch generate examplesite/crawldb examplesite/segments -expr 
> "(pageId == '')" -adddays 30
>   
> Please comment if you see any issues with this approach.
> 
> Thanks
> Sujan
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
> Sent: Thursday, October 06, 2016 7:32 PM
> To: user@nutch.apache.org
> Subject: RE: nutch 1.12 How can I force a URL to get re-indexed
> 
> Hi
> 
> You can use -adddays N in the generator job to fool it, or just use a lower 
> interval. Or, use the freegen tool to immediately crawl a set of URL's.
> 
> Markus
> 
>  
>  
> -Original message-
> > From:Sujan Suppala 
> > Sent: Thursday 6th October 2016 15:56
> > To: user@nutch.apache.org
> > Subject: nutch 1.12 How can I force a URL to get re-indexed
> > 
> > Hi,
> > 
> > By default the nutch is fetching the URL based on the already set next 
> > fetch interval(30 days), suppose if the page is updated before this 
> > interval (30 days) how can I force to re-index?
> > 
> > How can I just 're-inject' the URLs to set the next fetch date to 
> > 'immediately'?
> > 
> > Fyi, I am using the nutch rest api client for to index the URLs.
> > 
> > Thanks
> > Sujan
> > 
> 


RE: nutch 1.12 How can I force a URL to get re-indexed

2016-10-07 Thread Sujan Suppala
Thanks Markus.

I can not use freegen as this tool is not available via REST api.

With the combination of -adddays and -expr options of generator I achieved my 
requirement. 
Here is what I did:
1. inject the urls with some metadata say pageId=
Seed file contains the below entry:
http://localhost:9090/nutchsite/html/page1.html pageId=

2. now issue the generate command with the -adddays(to make all the urls to be 
due for fetch) and -expr(to filter out the urls) options to select only the 
urls to be fetched again as below:
$ bin/nutch generate examplesite/crawldb examplesite/segments -expr 
"(pageId == '')" -adddays 30

Please comment if you see any issues with this approach.

Thanks
Sujan

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Thursday, October 06, 2016 7:32 PM
To: user@nutch.apache.org
Subject: RE: nutch 1.12 How can I force a URL to get re-indexed

Hi

You can use -adddays N in the generator job to fool it, or just use a lower 
interval. Or, use the freegen tool to immediately crawl a set of URL's.

Markus

 
 
-Original message-
> From:Sujan Suppala 
> Sent: Thursday 6th October 2016 15:56
> To: user@nutch.apache.org
> Subject: nutch 1.12 How can I force a URL to get re-indexed
> 
> Hi,
> 
> By default the nutch is fetching the URL based on the already set next fetch 
> interval(30 days), suppose if the page is updated before this interval (30 
> days) how can I force to re-index?
> 
> How can I just 're-inject' the URLs to set the next fetch date to 
> 'immediately'?
> 
> Fyi, I am using the nutch rest api client for to index the URLs.
> 
> Thanks
> Sujan
> 


Re: Nutch as a service

2016-10-07 Thread Sachin Shaju
Hi Furkan,
 I've checked giving null for args. It didn't work either.
After investigating source code of *Fetcher.java* I've figured out it is
looking for segment in local path if a segment option is not added. If
segment option is added as a valid segment in hdfs it will work. I've
resolved that issue by returning segment path from generate phase in
results JSON in generate rest call. Added one or two lines in source code
of *Generator.java* file and it works. Am not sure if this is the way to do
this. But still it works.  Please write to me if there is any better option.

Everything works until index phase. Indexing to elasticsearch is failing by
throwing an unknown exception. Please have a look at
http://www.mail-archive.com/user%40nutch.apache.org/msg15001.html

Regards,
Sachin Shaju

sachi...@mstack.com

On Thu, Oct 6, 2016 at 10:12 PM, Furkan KAMACI 
wrote:

> Hi Sachin,
>
> Could you check it again with sending *null* instead of *{}* ?
>
> Kind Regards,
> Furkan KAMACI
>
> On Thu, Oct 6, 2016 at 7:20 AM, Sachin Shaju  wrote:
>
> > Hi Sujen,
> >   Thanks for the reply. Actually that stackoverflow post was
> > created by me itself. :) I have some more queries.
> >  1. Do I have to run the server on hadoop namenode itself ?
> >  2. I have tested nutch server in hadoop. But on *fetch phase* it is
> > encountering *NullPointer* exception. That I can post here.
> > 16/10/05 18:53:59 ERROR impl.JobWorker: Cannot run job worker!
> >
> > java.lang.NullPointerException
> > at java.util.Arrays.sort(Arrays.java:1438)
> > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:564)
> > at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:71)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> >
> > I've checked source code. It is due to the absence of a parameter segment
> > in REST call for fetch. I'm expecting it to pick the latest segment
> > automatically. But it is not working that way.
> >
> > The request I've used is :-
> >
> > *POST /job/create*
> > *{   *
> > *"type":"FETCH",*
> > *"confId":"news",*
> > *"crawlId":"crawl001",*
> > *"args": {}*
> > *}*
> >
> > Am I missing anything here ?
> >
> >
> >
> >
> > Regards,
> > Sachin Shaju
> >
> > sachi...@mstack.com
> > +919539887554
> >
> > On Thu, Oct 6, 2016 at 5:03 AM, Sujen Shah  wrote:
> >
> > > Hi Sachin,
> > >
> > > Nutch REST API is built using Apache CXF framework and JAX-RS. The
> Nutch
> > > Server uses an embedded Jetty Server to service the http requests.
> > > You can find out more about CXF and Jetty here (
> > > http://cxf.apache.org/docs/overview.html).
> > >
> > > The server runs on one machine waiting for http requests. Once a
> request
> > is
> > > received it will start the respective Nutch Job requested (which might
> be
> > > distributed ex- fetch job)
> > >
> > >
> > > Just for visibility on the user list, this question was asked on
> > > stackoverflow. Link to the question and follow up discussion can be
> found
> > > at -
> > > http://stackoverflow.com/questions/39853492/working-of-
> > > nutch-server-in-distributed-mode
> > >
> > > Thanks
> > > Sujen
> > >
> > >
> > >
> > > Regards,
> > > Sujen Shah
> > > M.S - Computer Science
> > > University of Southern California
> > > http://www.linkedin.com/in/sujenshah
> > >
> > > On Tue, Oct 4, 2016 at 6:18 AM, Sachin Shaju 
> > wrote:
> > >
> > > > Hi,
> > > > I would like to know how nutch server works actually? Whether it
> > use
> > > a
> > > > listener for incoming crawl requests or it is a continuously running
> > > > server?
> > > > Regards,
> > > > Sachin Shaju
> > > >
> > > > sachi...@mstack.com
> > > >
> > > > --
> > > >
> > > >
> > > > The information contained in this electronic message and any
> > attachments
> > > to
> > > > this message are intended for the exclusive use of the addressee(s)
> and
> > > may
> > > > contain proprietary, confidential or privileged information. If you
> are
> > > not
> > > > the intended recipient, you should not disseminate, distribute or
> copy
> > > this
> > > > e-mail. Please notify the sender immediately and destroy all copies
> of
> > > this
> > > > message and any attachments.
> > > >
> > > > WARNING: Computer viruses can be transmitted via email. The recipient
> > > > should check this email and any attachments for the presence of
> > viruses.
> > > > The company accepts no liability for any damage caused by any virus
> > > > transmitted by this email.
> > > >
> > > > www.mStack.com
> > > >
> > >
> >
> > --
> >
> >
> > The information contained in this electronic message and any attachments
> to
> > this message are intended for the exclusive use of the addressee(s) and
> may
> > contain proprietary, confidential or 

Unknown issue in Nutch indexer with REST api

2016-10-07 Thread Sachin Shaju
Hi,
I was trying to expose nutch using REST endpoints and ran into an issue
in indexer phase. I'm using elasticsearch index writer to index docs to ES.
I've used $NUTCH_HOME/runtime/deploy/bin/nutch startserver command. While
indexing an unknown exception is thrown.

Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:47 INFO mapreduce.Job:  map 100% reduce 0%
16/10/07 16:01:49 INFO mapreduce.Job: Task Id :
attempt_1475748314769_0107_r_00_1, Status : FAILED
Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:53 INFO mapreduce.Job: Task Id :
attempt_1475748314769_0107_r_00_2, Status : FAILED
Error:
com.google.common.util.concurrent.MoreExecutors.directExecutor()Ljava/util/concurrent/Executor;
16/10/07 16:01:58 INFO mapreduce.Job:  map 100% reduce 100%
16/10/07 16:01:59 INFO mapreduce.Job: Job job_1475748314769_0107 failed
with state FAILED due to: Task failed task_1475748314769_0107_r_00
Job failed as tasks failed. failedMaps:0 failedReduces:1

ERROR indexer.IndexingJob: Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

Failed with exit code 255.

Any help would be appreciated.

PS : After debugging using stack trace I think the issue is due to mismatch
in guava version. I've tried changing build.xml of plugins(parse-tika and
parsefilter-naivebayes) but it didn't work.


Regards,
Sachin Shaju

sachi...@mstack.com

-- 
 

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you should not disseminate, distribute or copy this 
e-mail. Please notify the sender immediately and destroy all copies of this 
message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient 
should check this email and any attachments for the presence of viruses. 
The company accepts no liability for any damage caused by any virus 
transmitted by this email.

www.mStack.com


Re: Issue Crawling Alternate URLs

2016-10-07 Thread Sebastian Nagel
Hi Matthew,

afaics, the content delivered to Nutch under the URL

  http://rssfeeds.azcentral.com/phoenix/asu

does not contain the link

  http://rssfeeds.azcentral.com/phoenix/asu=1

That's the simple answer. What you see in a browser is often not that what is 
delivered from the
server to a spider. I've tested both Nutch and wget, see below.

Best,
Sebastian


% bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http \
 -verbose http://rssfeeds.azcentral.com/phoenix/asu
Status: success(1), lastModified=0
Content Type: application/rss+xml
Content Length: null
Content:

http://rssfeeds.azcentral.com/feedblitz_rss.xslt;?>http://purl.org/rss/1.0/modules/content/;  version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0;>
  
Phoenix - ASU
http://api-internal.usatoday.com.akadns.net
...

% wget -O azcentral.asu.wget.xml http://rssfeeds.azcentral.com/phoenix/asu
--2016-10-07 09:32:21--  http://rssfeeds.azcentral.com/phoenix/asu
Resolving rssfeeds.azcentral.com (rssfeeds.azcentral.com)... 198.251.67.124, 
198.251.67.127,
198.71.59.197, ...
Connecting to rssfeeds.azcentral.com 
(rssfeeds.azcentral.com)|198.251.67.124|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/xml]
Saving to: ‘azcentral.asu.wget.xml’

azcentral.asu.wget.xml  [ <=>
  ] 136.25K  --.-KB/sin 0.01s

2016-10-07 09:32:23 (11.6 MB/s) - ‘azcentral.asu.wget.xml’ saved [139517]

% grep -F 'http://rssfeeds.azcentral.com/phoenix/asu=1' azcentral.asu.wget.xml

(nothing found)


On 10/06/2016 05:37 PM, Adler, Matthew (US) wrote:
> Hi Sebastian:
> 
> You are correct in terms of the first URL, which isn't my issue.  The issue 
> is that if I am attempting to crawl that initial page, 
> http://rssfeeds.azcentral.com/phoenix/asu, I want nutch to find RSS page 
> linked from it, which is this one:
> 
> http://rssfeeds.azcentral.com/phoenix/asu=1
> 
> The issue though, is nutch can't seem to find that link.  From what I can 
> tell the reason is due to the structure of the link tag, which is:
> 
>  href="http://rssfeeds.azcentral.com/phoenix/asu=1; title="Phoenix - ASU">
> 
> Please let know if this clarifies the issue.
> 
> Cheers,
> MA
> 
> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Thursday, October 06, 2016 8:26 AM
> To: user@nutch.apache.org
> Subject: Re: Issue Crawling Alternate URLs
> 
> Hi,
> 
>> http://rssfeeds.azcentral.com/phoenix/asu
> 
> That's already an RSS feed which unluckily fails to parse:
> (using plugin "feed")
>  Status: failed(2,200): com.sun.syndication.io.ParsingFeedException: Invalid 
> XML: Error on line 183:
> XML document structures must start and end within the same entity.
> (using "parse-tika")
>  Caused by: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on 
> line 188: XML document structures must start and end within the same entity.
> 
> 
> When opening the URL in a browser (Firefox) the server sends a HTML page.
> At least, that's what I got when trying it:
> 
> % wget -q -O - http://rssfeeds.azcentral.com/phoenix/asu | head  version="1.0"?>  href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt;?> xmlns:content="http://purl.org/rss/1.0/modules/content/;  version="2.0"
> xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0;>
>   
> Phoenix - ASU
> http://api-internal.usatoday.com.akadns.net
> Phoenix - ASU
> Copyright 2016, GANNETT
> en-us
> 
> http://www.azcentral.com/story/sports/ncaaf/asu/2016/10/05/arizona-state-football-needs-reignite-run-game-against-ucla/91631636/
> 
> 
> Best,
> Sebastian
> 
> On 10/05/2016 02:08 PM, Adler, Matthew (US) wrote:
>> Hello Nutch Users:
>>
>> I’m currently having an issue with Nutch 1.4, similar to the one logged here:
>>
>> https://issues.apache.org/jira/browse/NUTCH-2319
>>
>> Using the example in that JIRA issue, if I am on the following URL:
>> http://rssfeeds.azcentral.com/phoenix/asu
>>
>> I expect that nutch will be able to find the alternate linked URL, specified 
>> in the following link tag:
>>
>> > href="http://rssfeeds.azcentral.com/phoenix/asux=1;
>> title="Phoenix - ASU">
>>
>> It does not however, even though I’ve tried to make a few changes to the 
>> RegEX in in suffix-urlfilter.txt, regex-normalize.xml, regex-urlfilter.txt, 
>> and prefix-urlfilter.txt but have not had any success.
>>
>> Any feedback would be appreciated.
>>
>> Please let me know,
>>
>> MA
>> This message contains information which may be confidential and privileged. 
>> Unless you are the intended addressee (or authorized to receive for the 
>> intended addressee), you may not use, copy or disclose to anyone the message 
>> or any information contained in the message. If you have received the 
>> message in error, please advise the sender by reply and delete the message.
>>
> 
> This message contains information which may be confidential and privileged. 
> Unless you are the