Unable to crawl a URL unless session cookies are set

2014-12-02 Thread Krishnanand, Kartik
Hi, I am crawling an internal site where the URL that I want to crawl. I hope that someone can help When I load this URL in the browser, it does a 301 redirect to another URL that sets up cookies that will expire until end of session. When I load the URL again in the browser, I am now able

ERROR: [doc=http://nutch.apache.org/] unknown field 'metatag.keywords'

2014-12-02 Thread arthur.hk.c...@gmail.com
Hi, I am new to Nutch and Solr, please help!! I am using Nutch-1.9, solr-4.10.2 and Hadoop-2.4.1, It always returns org.apache.solr.common.SolrException: Bad Request” (I have already copied [nutch]conf/schema.xml to [solr]/collection1/conf/schema.xml and restarted solr) Below is about my

Re: ERROR: [doc=http://nutch.apache.org/] unknown field 'metatag.keywords'

2014-12-02 Thread Jonathan Cooper-Ellis
Hi, In solrindex-mapping.xml, try changing the values for source to metatag.keywords and metatag.description. Or change the fields Solr is expecting to metatag.keywords and metatag.description. Hope that helps! On Tue, Dec 2, 2014 at 4:52 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com

Re: ERROR: [doc=http://nutch.apache.org/] unknown field 'metatag.keywords'

2014-12-02 Thread arthur.hk.c...@gmail.com
Hi, Thank you!! I fixed the issue related to unknown field ‘metatag.keywords’”. regards Arthur On 2 Dec, 2014, at 10:23 pm, Jonathan Cooper-Ellis j...@ziftr.com wrote: Hi, In solrindex-mapping.xml, try changing the values for source to metatag.keywords and metatag.description. Or

org.apache.solr.common.SolrException, unknown field 'host'

2014-12-02 Thread arthur.hk.c...@gmail.com
Hi, I am new to Nutch and Solr, please help!! I am using Nutch-1.9, Solr 4.10.2 and Hadoop 2.4.1 I always get org.apache.solr.common.SolrException, unknown field ‘host’, what would be wrong? The schema.xml has field name=host type=string stored=false indexed=true/ I have already copied

Re: Unable to crawl a URL unless session cookies are set

2014-12-02 Thread remi tassing
Hi Kartik, I had a similar enquiry a long time ago and from what I remember, Nutch will save the new URL and crawl it in the future...which is not the needed behavior here. To solve this problem, I've customized my protocol-httpclient (HttpResponse class) to just open the 2nd URL right after the