Re: Nutch failing on SOLR text field

2019-03-26 Thread Dave Beckstrom
Hi Jorge,

I'm running Solr 7.3.1 which is compatible with the version of Nutch I'm
running.


Field is defined as:



I think this is the relevant part from the stack trace:

at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NumberFormatException: For input string:
"myfieldname:docid:33011-54192-XXHServer-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Long.parseLong(Unknown Source)
at java.lang.Long.parseLong(Unknown Source)
at
org.apache.solr.schema.LongPointField.createField(LongPointField.java:154)
at org.apache.solr.schema.PointField.createFields(PointField.java:250)
at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:66)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:159)

It seems to be treating it as a number?



Best,

Dave

On Tue, Mar 26, 2019 at 5:06 PM Jorge Betancourt 
wrote:

> Hi Dave,
>
> Can you check the Solr logs and post the relevant exception?. Also it would
> be helpful if you attach the definition of the text field in your Solr
> collection.
>
> Best regards,
> Jorge
>
> On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom 
> wrote:
>
> > Hi Everyone,
> >
> > This is probably more of a SOLR question but I'm hoping someone might be
> > able to help.  I'm  using Nutch to crawl and index some content.  It
> failed
> > on a SOLR field defined as a text field when it was trying to insert the
> > following value for the field:
> >
> > 33011-54192-EWHServer1234-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6
> >
> > The field was defined in the schema.xml as:
> >
> >  > indexed="true"/>
> >
> > The error message said it was a RemoteSolrException from the server and
> > that it was an error adding the field.  I'm pretty certain the issue was
> > the value being inserted as it worked fine for 100's of pages and then
> > failed on the one page that had data formatted differently than on other
> > pages.
> >
> > From what I was able to find searching, it doesn't look like the length
> of
> > the data would be any issue at all for a text field.  I am wondering if
> the
> > problem is the dashes (hyphens) in the data?
> >
> > Any suggestions on how to fix this?  I can delete the collection and
> > redefine it with a field other than text, if that is the answer.
> >
> > Thank you!
> >
> > Dave
> >
> > --
> > *Fig Leaf Software, Inc.*
> > https://www.figleaf.com/
> > 
> >
> > Full-Service Solutions Integrator
> >
> >
> >
> >
> >
> >
> >
>

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








Re: Nutch failing on SOLR text field

2019-03-26 Thread Jorge Betancourt
Hi Dave,

Can you check the Solr logs and post the relevant exception?. Also it would
be helpful if you attach the definition of the text field in your Solr
collection.

Best regards,
Jorge

On Tue, Mar 26, 2019 at 9:41 PM Dave Beckstrom 
wrote:

> Hi Everyone,
>
> This is probably more of a SOLR question but I'm hoping someone might be
> able to help.  I'm  using Nutch to crawl and index some content.  It failed
> on a SOLR field defined as a text field when it was trying to insert the
> following value for the field:
>
> 33011-54192-EWHServer1234-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6
>
> The field was defined in the schema.xml as:
>
>  indexed="true"/>
>
> The error message said it was a RemoteSolrException from the server and
> that it was an error adding the field.  I'm pretty certain the issue was
> the value being inserted as it worked fine for 100's of pages and then
> failed on the one page that had data formatted differently than on other
> pages.
>
> From what I was able to find searching, it doesn't look like the length of
> the data would be any issue at all for a text field.  I am wondering if the
> problem is the dashes (hyphens) in the data?
>
> Any suggestions on how to fix this?  I can delete the collection and
> redefine it with a field other than text, if that is the answer.
>
> Thank you!
>
> Dave
>
> --
> *Fig Leaf Software, Inc.*
> https://www.figleaf.com/
> 
>
> Full-Service Solutions Integrator
>
>
>
>
>
>
>


Nutch failing on SOLR text field

2019-03-26 Thread Dave Beckstrom
Hi Everyone,

This is probably more of a SOLR question but I'm hoping someone might be
able to help.  I'm  using Nutch to crawl and index some content.  It failed
on a SOLR field defined as a text field when it was trying to insert the
following value for the field:

33011-54192-EWHServer1234-3BA9D1CA-05B6-42BA-9D88-BAD970CAEEC6

The field was defined in the schema.xml as:



The error message said it was a RemoteSolrException from the server and
that it was an error adding the field.  I'm pretty certain the issue was
the value being inserted as it worked fine for 100's of pages and then
failed on the one page that had data formatted differently than on other
pages.

>From what I was able to find searching, it doesn't look like the length of
the data would be any issue at all for a text field.  I am wondering if the
problem is the dashes (hyphens) in the data?

Any suggestions on how to fix this?  I can delete the collection and
redefine it with a field other than text, if that is the answer.

Thank you!

Dave

-- 
*Fig Leaf Software, Inc.* 
https://www.figleaf.com/ 
  

Full-Service Solutions Integrator








RE: Meta tags are duplicated

2019-03-26 Thread Sadiki Latty
Hey,

This is caused by usage of the Tika plugin and MetatagParser. I am currently 
using this patch to resolve the issue

https://issues.apache.org/jira/browse/NUTCH-1559

Cheers,

Sadiki Latty
Web Developer/ Développeur Web
Technologies de l’information / Information Technology
Université d'Ottawa | University of Ottawa 
1 Nicholas (801)
613-562-5800 ext. 7512


-Original Message-
From: hany.n...@hsbc.com.INVALID [mailto:hany.n...@hsbc.com.INVALID] 
Sent: March 26, 2019 4:53 AM
To: user@nutch.apache.org
Subject: Meta tags are duplicated

Hello

I'm using Nutch 1.15 and parsing/indexing meta tags using parse-metatags plugin.

Values are always come duplicated and forced me to change Solr fields to 
multivalue.

Example:  

Moreover, I ran indexchecker and can see the duplication as well.

Any advice how to remove this duplication?

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC 
Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, 
Poland __

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy, forward, disclose or use any part of it. If you have received this 
message in error, please delete it and all copies from your system and notify 
the sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.


Meta tags are duplicated

2019-03-26 Thread hany . nasr
Hello

I'm using Nutch 1.15 and parsing/indexing meta tags using parse-metatags plugin.

Values are always come duplicated and forced me to change Solr fields to 
multivalue.

Example:  

Moreover, I ran indexchecker and can see the duplication as well.

Any advice how to remove this duplication?

Kind regards,
Hany Shehata
Enterprise Engineer
Green Six Sigma Certified
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
__

Tie line: 7148 7689 4698
External: +48 123 42 0698
Mobile: +48 723 680 278
E-mail: hany.n...@hsbc.com
__
Protect our environment - please only print this if you have to!



-
SAVE PAPER - THINK BEFORE YOU PRINT!

This E-mail is confidential.  

It may also be legally privileged. If you are not the addressee you may not 
copy,
forward, disclose or use any part of it. If you have received this message in 
error,
please delete it and all copies from your system and notify the sender 
immediately by
return E-mail.

Internet communications cannot be guaranteed to be timely secure, error or 
virus-free.
The sender does not accept liability for any errors or omissions.