Re: multiple values encountered for non multiValued field keywords

2019-07-17 Thread Ryan Suarez
Hi Sebastian,

I've got it working now. I had to remove the following entries in schema.xml:







...and add multivalued to managed-schema using API, no restart required:

# curl -X POST -H 'Content-type:application/json' --data-binary 
'{"replace-field":{"name":"keywords","type":"text_general","stored":"true","indexed":"true","multiValued":"true"}}'
 http://localhost:8983/solr/nutch/schema
# curl -X POST -H 'Content-type:application/json' --data-binary 
'{"replace-field":{"name":"description","type":"text_general","stored":"true","indexed":"true","multiValued":"true"}}'
 http://localhost:8983/solr/nutch/schema
# curl -X POST -H 'Content-type:application/json' --data-binary 
'{"replace-dynamic-field":{"name":"*_str","type":"string", 
"stored":"true","indexed":"true","multiValued":"true"}}' 
http://localhost:8983/solr/nutch/schema

Now I'm not sure why solr is using keywords and description, instead of 
metatag.keywords and metatag.description but at least it shows up in the 
results. Thanks for pointing me in the right direction.

regards,
Ryan

On Wed, 2019-07-17 at 22:10 +0200, Sebastian Nagel wrote:
Hi Ryan,

could be caused by the managed schema. Note for Solr 7.x updating the schema.xml
alone may be not sufficient, see
  https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search

Let us know whether this works. Thanks!

And we'll update the wiki page, resp. in the new wiki:
  https://cwiki.apache.org/confluence/display/NUTCH/IndexMetatags
(migration is ongoing)

Best,
Sebastian

On 7/17/19 9:52 PM, Ryan Suarez wrote:
Greetings,

I am trying to configure Nutch v1.15 and Solr v7.40 to index meta tags:
https://wiki.apache.org/nutch/IndexMetatags

However, I'm getting the following error:

java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:8983/solr/nutch: ERROR: 
[doc=https://mysite.domain.ca/<https://www.sheridancollege.ca/>;] multiple 
values encountered for non multiValued field keywords: 
[somevalue1,somevalue2,somevalue3,etc]

I've included the following entry in 
$SOLR_HOME/server/solr/configsets/nutch/conf/schema.xml







It contains multiValued="true" and I did restart SOLR but the problem persists. 
I'm stuck at this point. Any ideas?

regards,
Ryan





multiple values encountered for non multiValued field keywords

2019-07-17 Thread Ryan Suarez
Greetings,

I am trying to configure Nutch v1.15 and Solr v7.40 to index meta tags:
https://wiki.apache.org/nutch/IndexMetatags

However, I'm getting the following error:

java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at http://localhost:8983/solr/nutch: ERROR: 
[doc=https://mysite.domain.ca/] multiple 
values encountered for non multiValued field keywords: 
[somevalue1,somevalue2,somevalue3,etc]

I've included the following entry in 
$SOLR_HOME/server/solr/configsets/nutch/conf/schema.xml







It contains multiValued="true" and I did restart SOLR but the problem persists. 
I'm stuck at this point. Any ideas?

regards,
Ryan


Re: IllegalArgumentException: No form exists: user-login-form

2019-07-09 Thread Ryan Suarez
ok, so the error message is quite clear.  There is no form on that link
you provided with an id or name of 'user-login-form'.

On Mon, 2019-07-08 at 22:39 -0400, Susheel Kumar wrote:
> Hello Sebastian,
> 
> Thanks for getting back.  Here is the Login.html link which is
> throwing no
> form exists error.
> 
> https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0
> 
> Please take a look and suggest what could be wrong when trying to
> sign in
> to this site.
> 
> Also below content of auth-configuration section of httpclient-
> auth.xml
> 
> ---
>   loginUrl="https://qa.mysite.sitecorp.com/user/login;
> loginFormId="user-login-form"
> loginRedirect="false">
>  
>   value="Crawler"/>
>   value="spid3r_us"/>
>  
>  
>   value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
> Safari/537.36"
> />
>  
>  
>
>  
>  
>BROWSER_COMPATIBILITY
>  
>
> 
> 
> 
> 
> On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
>  wrote:
> 
> > Hi,
> > 
> > the error message is quite clear:
> > 
> > > 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication -
> > > No form
> > > element found with 'id' = user-login-form, trying 'name'.
> > > 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication -
> > > No form
> > > element found with 'name' = user-login-form
> > 
> > But without access to the login page content, it's nearly
> > impossible to
> > determine
> > what's going wrong.
> > 
> > 
> > > I tried crawling the same url/login page using Selenium Chrome
> > > Drive and
> > 
> > it
> > > does load and fill in the user id/pwd text boxes.
> > 
> > Sounds like the page HTML source looks different with Selenium.
> > Note that
> > the
> > protocol-httpclient does not modify the DOM tree via Javascript, it
> > is
> > derived
> > from the bare HTML only.  That could be a reason why the form
> > element is
> > not found
> > while it works in a browser (emulation).
> > 
> > 
> > Best,
> > Sebastian
> > 


Tracing crawled sites

2019-04-09 Thread Ryan Suarez
Greetings,

We are running nutch v1.5 with SOLR v7.3.1

I would like to determine how a specific site was crawled.  What were
the parent links that the nutch crawler followed all the way back to
the root?  

Could someone let me know what is the best way to accomplish this?

regards,
Ryan


Re: Error Updating Solr

2019-02-28 Thread Ryan Suarez
Add this to your schema.xml:


https://lucene.apache.org/solr/guide/6_6/dynamic-fields.html

On Thu, 2019-02-28 at 16:45 -0600, Dave Beckstrom wrote:
I'm getting much closer to getting Nutch and SOLR to play well together.
(Ryan - thanks for your help on my last question.  Your suggestion fixed
that issue)

What is happening now is that Nutch finishes crawling, then calls the
index-writer to update solr.  The SOLR update fails with this message:

2019-02-28 17:34:33,742 WARN  mapred.LocalJobRunner -
job_local966037581_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://127.0.0.1:8981/solr/: copyField dest
:'metatag.description_str' is not an explicit field and doesn't match a
dynamicField.

The mapping section of the index writer originally had this value:



I have removed everything from the mapping section of the index-writer and
cycled the SOLR service.  My mapping section now looks like this:


  
  
  

  


I cannot find any reference to "metatag.description_str" in any of the
nutch xml files or the SOLR files.

Any idea of how I can fix this issue?

Thank you!

--
*Fig Leaf Software, Inc.*
https://www.figleaf.com/


Full-Service Solutions Integrator








Re: Configuring Nutch to work with Solr?

2019-02-27 Thread Ryan Suarez
Try adding this to schema.xml:





On Wed, 2019-02-27 at 15:49 -0600, Dave Beckstrom wrote:
This message was sent from outside of Sheridan College. Please be careful when 
opening attachments, clicking links, or responding to requests for information.



Hi Everyone,

I'm a developer and I am installing Nutch with Solr for a client.  I've
been reading everything I can get my hands on and I am just not finding the
answers to some questions.   I'm really hoping you can help!

I have Nutch 1.15 and Solr 7.3.1 installed on a Windows server.   Those
appeared to be the most current compatible versions where the Nutch
binaries were available.

The first question I have is regarding the "/conf/schema.xml"  that ships
with Nutch.  As I understand it, that file needs to be copied over to Solr
for use with the collection on Solr.

Is the schema.xml file used in the creation of the new collection on solr?
In other words, does the Nutch schema.xml file need to be copied to Solr
first and then the collection created or do you first create a collection
using the default schema.xml that ships with solr and after the collection
has been created then replace schema.xml with the nutch version of
schema.xml?

If I try and copy the nutch schema.xml over to SOLR first and then create a
collection it throws the following error:

  fieldType 'pdates' not found in the schema

Thanks!

Best,

Dave

--
*Fig Leaf Software, Inc.*
https://www.figleaf.com/


Full-Service Solutions Integrator








Re: index-replace: variable substitution?

2018-10-24 Thread Ryan Suarez
Hi Yossi,

Thank you.  I finally got it to work using this configuration:


index.replace.regexp

   url:site=/https?:..([a-zA-Z0-9]+).mydomain.ca.*/$1/



cheers,
Ryan

On Sat, 2018-10-13 at 03:13 +0300, Yossi Tamari wrote:
> Hi Ryan,
> 
>  
> 
> From looking at the code of index-replace, it uses Java's
> Matcher.replaceAll <
> https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-
> > , so $1 (for example) should work.
> 
>  
> 
> Yossi. 
> 
>  
> 
> > -Original Message-
> > From: Ryan Suarez 
> > Sent: 13 October 2018 01:38
> > To: user@nutch.apache.org
> > Subject: index-replace: variable substitution?
> > 
> > Greetings,
> > 
> > I'm using binaries of nutch v1.15 with solr v7.3.1, and index-
> > replace to copy a
> > substring of the 'url' field to a new 'site' field.  Here is the
> > definition in my nutch-
> > site.xml:
> > 
> > 
> > index.replace.regexp
> > 
> >urlmatch=.*www.mydomain.ca.*
> > ; url:site=/.*
> > www.mydomain.ca.*/www/
> > 
> >urlmatch=.*foo.mydomain.ca.*
> > 
> > url:site=/.*foo.mydomain.ca.*/foo/
> > 
> >urlmatch=.*bar.mydomain.ca.*
> > 
> > url:site=/.*bar.mydomain.ca.*/bar/
> > 
> > 
> > 
> > This works as expected.  I am given the following site values for
> > the given url
> > values:
> > 
> > url:  <https://www.mydomain.ca/test/path> 
> > https://www.mydomain.ca/test/path -> site: www
> > url:  <http://foo.mydomain.ca/some/other/path> 
> > http://foo.mydomain.ca/some/other/path -> site: foo
> > url:  <https://bar.mydomain.ca/another/example> 
> > https://bar.mydomain.ca/another/example -> site: foo
> > 
> > However, it means I have to have a definition for every host or
> > subdomain I am
> > crawling (ie. www, foo, bar).  Can I use variable substitution in
> > index-replace or
> > is there another way for me to do this automatically?
> > 
> > regards,
> > Ryan
> 
> 


index-replace: variable substitution?

2018-10-12 Thread Ryan Suarez
Greetings,

I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace
to copy a substring of the 'url' field to a new 'site' field.  Here is
the definition in my nutch-site.xml:


index.replace.regexp

   urlmatch=.*www.mydomain.ca.*
   url:site=/.*www.mydomain.ca.*/www/

   urlmatch=.*foo.mydomain.ca.*
   url:site=/.*foo.mydomain.ca.*/foo/

   urlmatch=.*bar.mydomain.ca.*
   url:site=/.*bar.mydomain.ca.*/bar/



This works as expected.  I am given the following site values for the
given url values:

url: https://www.mydomain.ca/test/path -> site: www
url: http://foo.mydomain.ca/some/other/path -> site: foo
url: https://bar.mydomain.ca/another/example -> site: foo

However, it means I have to have a definition for every host or
subdomain I am crawling (ie. www, foo, bar).  Can I use variable
substitution in index-replace or is there another way for me to do this
automatically?

regards,
Ryan