Date missing from Solr, even though in HTTP last-modified

2016-10-18 Thread Tom Chiverton
I have "index-(basic|anchor|more|metadata)" and 
"parse-(html|tika|metatags)" included in plugin.includes, but despite:



# bin/nutch parsechecker https:/. |grep -i date
Date :  Tue, 18 Oct 2016 14:37:40 GMT


The 'date' field in Solr for the document is wrong :

|"date": "1970-01-01T00:00:00Z",|


Why is this ? Also, as I think 'date' is being inferred from the 
'last-modified' header, I'd like it to go in 'lastModified' too...


I saw some reference to setting solrindex-mapping.xml

but this dies during IndexingJob with
Caused by: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: 
ERROR: [doc=com.abloz:http/hbase/book.html] multiple values encountered 
for non multiValued field lastModified: [Tue Jun 16 10:55:02 UTC 2015, 
Tue Jun 16 10:55:02 UTC 2015]


which makes no sense. There aren't two last-modified HTTP headers ? It 
does at least confirm the value is going in...


The Solr schema is correct, I think (there's no real world reason for 
lastModified to be multi valued!) :

 


--
*Tom Chiverton*
Lead Developer
e:  t...@extravision.com 
p:  0161 817 2922
t:  @extravision 
w:  www.extravision.com 

Extravision - email worth seeing 
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, 
Manchester, M15 4LD.

Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed 
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author 
and do not necessarily represent those of Extravision Ltd.




Re: Trouble fetch PDFs to pass to Tika (I think)

2016-10-18 Thread Tom Chiverton

That's only in nutch-default.xml, and is set to the default which is true.

Good idea though !

Tom


On 17/10/16 17:27, Julien Nioche wrote:

Hi Tom

You haven't modified the value for the config below by any chance?

 http.robots.403.allow

true

Some servers return HTTP status 403 (Forbidden) if

/robots.txt doesn't exist. This should probably mean that we are

allowed to crawl the site nonetheless. If this is set to false,

then such sites will be treated as forbidden.




The default value (true) should work fine.

Julien


On 17 October 2016 at 16:38, Tom Chiverton > wrote:


A site I am trying to index has it's HTML content on one domain,
and some linked PDFs on another (an Amazon S3 bucket).


So I have set up my plugin.includes in site.xml :



protocol-httpclient|urlfilter-regex|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)


and made sure regexp-urlfilter.xml is OK with it all.


But I observe some oddness during fetching, and can't locate the
PDFs in the Solr collection.

All the content on the PDF domain flys past with no pause :

-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0
kb/s, 0 URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching https://s3-eu-west-1.amazonaws.com/
 (queue crawl delay=5000ms)

and then it hits the primary domain and starts pausing between each :

Turning the log level for the fetcher to debug I see

DEBUG fetcher.FetcherJob - Denied by robots.txt: https://s3-eu-west-1.

but there is no robots.txt in the root of the Amazon S3 URL -
https://s3-eu-west-1.amazonaws.com/robots.txt
 is a 403 !

Any ideas what could be up ?

-- 
*Tom Chiverton*

Lead Developer
e:  t...@extravision.com 
p:  0161 817 2922
t:  @extravision 
w:  www.extravision.com 

Extravision - email worth seeing 
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is
addressed and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the
author and do not necessarily represent those of Extravision Ltd.




--
*
*/Open Source Solutions for Text Engineering/
/
/http://www.digitalpebble.com 
http://digitalpebble.blogspot.com/
#digitalpebble 

__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
__




Re: Nutch in production

2016-10-18 Thread lewis john mcgibbney
Hi Sachin,
Answering both of your questions here as I am catching up with some mail.

On Fri, Sep 30, 2016 at 5:04 AM,  wrote:

>
> From: Sachin Shaju 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 30 Sep 2016 10:00:04 +0530
> Subject: Re: Nutch in production
> Thank you guys for your replies. I will look into the suggestions you gave.
> But I have one more query. How can I trigger nutch from a queue system in a
> distributed environment ?


Well this is a bit more tricky of course, as per my other mailing list
thread, you can easily use the REST API and the Nutchserver for publishing
Nutch workflows so I would advise you to look into that.


> Can REST api be a real option in distributed mode
> ?


As per my other thread... yes :) The one limitation is getting the injected
URLs into HDFS for use within the rest of the workflow.


> Or whether I will have to go for a command line invocation for nutch ?
>
>
I think that we need to provide a patch for Nutch trunk to enable ingestion
of the injected seeds into HDFS via the REST API. Right now this
functionality is lacking. I've created a ticket for it at
https://issues.apache.org/jira/browse/NUTCH-2327

We will try to address this before the pending Nutch 1.13 release however I
cannot promise anything.
Thanjs
Lewis


Re: How to run nutch server on distributed environment

2016-10-18 Thread lewis john mcgibbney
Hi Sachin,
Very late response I know but hopefully better later than never. Response
below

On Fri, Sep 30, 2016 at 5:04 AM,  wrote:

>
> From: Sachin Shaju 
> To: user@nutch.apache.org
> Cc:
> Date: Thu, 29 Sep 2016 14:01:13 +0530
> Subject: How to run nutch server on distributed environment
> Hi,
>
> I have tested running of nutch in server mode by starting it using
> bin/nutch startserver command*locally*. Now I wonder whether I can start
> nutch in *server mode* on top of a hadoop cluster(in distributed
> environment) and submit crawl requests to server using nutch REST api ?
> Please help.
>
>
I am assuming you are running Nutch master branch (as the command is
'startserver').
The answer is yes, as long as your Yarn cluster is running well and that
your memory considerations are well suited to your crawl datasets then you
will be good. If I were you I would spend a bit of time running test crawls
with various fetch lists and batch sizes ensuring that you have no memory
issues and that your containers are not killed by ApplicationMaster.

On the Nutch side, please note that right now, when you POST a list(s) or
seed(s) they are cached in /var/something/something on the server running
Nutchserver NOT on HDFS meaning that you somehow need to get them onto HDFS
before you can use your seed list within the INJECT url_dir parameter.

If you need any help with this then simply consult the very helpful
documentation put together by Sujen at
https://wiki.apache.org/nutch/Nutch_1.X_RESTAPI
Let us know how you get on as the REST is very handy indeed. It would be
nice to build it into deployment managers such as Ambari in the future.

Lewis