RE: invalid utf8 chars when indexing or cleaning
Set logging to debug, HttpClient then logs what's being sent over the wire so you can catch the data. It is less tedious than Wireshark. -Original message- > From:Michael Coffey <mcof...@yahoo.com.INVALID> > Sent: Friday 1st September 2017 5:12 > To: user@nutch.apache.org > Subject: Re: invalid utf8 chars when indexing or cleaning > > It sounds like a good suggestion, but I don't know what you mean by "verify > the output Nutch generates and inspect it manually." How do I get a look at > that XML? > > > From: > To: "user@nutch.apache.org" <user@nutch.apache.org> > Sent: Thursday, August 31, 2017 11:59 AM > Subject: RE: invalid utf8 chars when indexing or cleaning > > The bug is identical, but i fixed it! You should verify the output Nutch > generates and inspect it manually, there should be a 0x at that byte. If > it really is there, we need to check the fix once more, despite that i am > sure the patch works as intended. > > Get the XML, pass it through the method and see what it does to the output. > > > > -Original message- > > From:Jorge Betancourt <betancourt.jo...@gmail.com> > > Sent: Tuesday 29th August 2017 21:54 > > To: user@nutch.apache.org > > Subject: Re: invalid utf8 chars when indexing or cleaning > > > > From the logs looks like the error is coming from the Solr side, do you > > mind checking/sharing the logs on your Solr server? Can you pin point which > > URL is causing the issue? > > Best Regards, Jorge > > > > On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <mcof...@yahoo.com.invalid> > > wrote: > > Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 > > bug that was fixed in version 1.4. > > Some more bits of information: the indexer job rarely fails (only 1 of the > > last 99 segments) but the cleaning job fails every time now. Once again, > > this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and > > Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of > > mismatch of versions? > > > > > > To: User <user@nutch.apache.org> > > Sent: Thursday, August 24, 2017 7:42 PM > > Subject: invalid utf8 chars when indexing or cleaning > > > > Lately, I have seen many tasks and jobs fail in Solr when doing nutch index > > and nutch clean. > > Messages during indexing look like this. > > 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99% > > 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : > > attempt_1502929850483_1329_r_07_2, Status : FAILED > > Error: > > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > > from server at http://codero4.neocortix.com:8984/solr/popular: [ > > com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char > > #104705, byte #219135) > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) > > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) > > at > > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) > > at > > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) > > at > > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) > > > > Messages during cleaning look like this. > > 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 > > INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_03_1, Status > > : FAILEDError: > > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > > from server at http://codero4.neocortix.com:8984/solr/popular: > > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char > > #16099, byte #16383) at > > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > > > > at > > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > > > > at > > org.apache.solr.client.solrj.
Re: invalid utf8 chars when indexing or cleaning
It sounds like a good suggestion, but I don't know what you mean by "verify the output Nutch generates and inspect it manually." How do I get a look at that XML? From: To: "user@nutch.apache.org" <user@nutch.apache.org> Sent: Thursday, August 31, 2017 11:59 AM Subject: RE: invalid utf8 chars when indexing or cleaning The bug is identical, but i fixed it! You should verify the output Nutch generates and inspect it manually, there should be a 0x at that byte. If it really is there, we need to check the fix once more, despite that i am sure the patch works as intended. Get the XML, pass it through the method and see what it does to the output. -Original message- > From:Jorge Betancourt <betancourt.jo...@gmail.com> > Sent: Tuesday 29th August 2017 21:54 > To: user@nutch.apache.org > Subject: Re: invalid utf8 chars when indexing or cleaning > > From the logs looks like the error is coming from the Solr side, do you > mind checking/sharing the logs on your Solr server? Can you pin point which > URL is causing the issue? > Best Regards, Jorge > > On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <mcof...@yahoo.com.invalid> > wrote: > Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 > bug that was fixed in version 1.4. > Some more bits of information: the indexer job rarely fails (only 1 of the > last 99 segments) but the cleaning job fails every time now. Once again, > this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and > Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of > mismatch of versions? > > > To: User <user@nutch.apache.org> > Sent: Thursday, August 24, 2017 7:42 PM > Subject: invalid utf8 chars when indexing or cleaning > > Lately, I have seen many tasks and jobs fail in Solr when doing nutch index > and nutch clean. > Messages during indexing look like this. > 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99% > 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : > attempt_1502929850483_1329_r_07_2, Status : FAILED > Error: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > from server at http://codero4.neocortix.com:8984/solr/popular: [ > com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char > #104705, byte #219135) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) > at > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) > > Messages during cleaning look like this. > 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 > INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_03_1, Status > : FAILEDError: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > from server at http://codero4.neocortix.com:8984/solr/popular: > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char > #16099, byte #16383) at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222) > > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187) > > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) > > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at > org.apache.nutch.indexer.CleaningJob$DeleterRed
RE: invalid utf8 chars when indexing or cleaning
The bug is identical, but i fixed it! You should verify the output Nutch generates and inspect it manually, there should be a 0x at that byte. If it really is there, we need to check the fix once more, despite that i am sure the patch works as intended. Get the XML, pass it through the method and see what it does to the output. -Original message- > From:Jorge Betancourt <betancourt.jo...@gmail.com> > Sent: Tuesday 29th August 2017 21:54 > To: user@nutch.apache.org > Subject: Re: invalid utf8 chars when indexing or cleaning > > From the logs looks like the error is coming from the Solr side, do you > mind checking/sharing the logs on your Solr server? Can you pin point which > URL is causing the issue? > Best Regards, Jorge > > On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffey <mcof...@yahoo.com.invalid> > wrote: > Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 > bug that was fixed in version 1.4. > Some more bits of information: the indexer job rarely fails (only 1 of the > last 99 segments) but the cleaning job fails every time now. Once again, > this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and > Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of > mismatch of versions? > > > To: User <user@nutch.apache.org> > Sent: Thursday, August 24, 2017 7:42 PM > Subject: invalid utf8 chars when indexing or cleaning > > Lately, I have seen many tasks and jobs fail in Solr when doing nutch index > and nutch clean. > Messages during indexing look like this. > 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99% > 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : > attempt_1502929850483_1329_r_07_2, Status : FAILED > Error: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > from server at http://codero4.neocortix.com:8984/solr/popular: [ > com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char > #104705, byte #219135) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) > at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) > at > org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) > at > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) > > Messages during cleaning look like this. > 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 > INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_03_1, Status > : FAILEDError: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error > from server at http://codero4.neocortix.com:8984/solr/popular: > [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char > #16099, byte #16383) at > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) > > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) > > at > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) > > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) > at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222) > > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187) > > at > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) > > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at > org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) > > at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245) > Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. > I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing > this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.
Re: invalid utf8 chars when indexing or cleaning
From the logs looks like the error is coming from the Solr side, do you mind checking/sharing the logs on your Solr server? Can you pin point which URL is causing the issue? Best Regards, Jorge On Tue, Aug 29, 2017 at 9:25 PM, Michael Coffeywrote: Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 bug that was fixed in version 1.4. Some more bits of information: the indexer job rarely fails (only 1 of the last 99 segments) but the cleaning job fails every time now. Once again, this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of mismatch of versions? To: User Sent: Thursday, August 24, 2017 7:42 PM Subject: invalid utf8 chars when indexing or cleaning Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean. Messages during indexing look like this. 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99% 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_07_2, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [ com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char #104705, byte #219135) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) Messages during cleaning look like this. 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_03_1, Status : FAILEDError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char #16099, byte #16383) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245) Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.
Re: invalid utf8 chars when indexing or cleaning
Does anybody have any thoughts on this? It seems similar to the NUTCH-1016 bug that was fixed in version 1.4. Some more bits of information: the indexer job rarely fails (only 1 of the last 99 segments) but the cleaning job fails every time now. Once again, this is Nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8 from Hadoop 2.7.2 and java 1.7. Could this be some kind of mismatch of versions? To: UserSent: Thursday, August 24, 2017 7:42 PM Subject: invalid utf8 chars when indexing or cleaning Lately, I have seen many tasks and jobs fail in Solr when doing nutch index and nutch clean. Messages during indexing look like this. 17/08/24 19:18:59 INFO mapreduce.Job: map 100% reduce 99% 17/08/24 19:19:36 INFO mapreduce.Job: Task Id : attempt_1502929850483_1329_r_07_2, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [ com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char #104705, byte #219135) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1220) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:209) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:173) at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41) at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493) Messages during cleaning look like this. 17/08/22 09:24:01 INFO mapreduce.Job: map 100% reduce 92%17/08/22 09:25:57 INFO mapreduce.Job: Task Id : attempt_1502929850483_1016_r_03_1, Status : FAILEDError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://codero4.neocortix.com:8984/solr/popular: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0x at char #16099, byte #16383)at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:575) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:222) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:187) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:245) Can anyone suggest a way to fix this? I am using nutch 1.12 and Solr 5.4.1. I recently upgraded to hadoop 2.7.4 and Java 1.8. I don't remember noticing this happening with Hadoop 2.7.2 and java 1.7. It happens very often now.