That is odd! Is it on your content or title field?
Markus
-----Original message-----
> From:Kshitij Shukla <[email protected]>
> Sent: Monday 25th January 2016 11:41
> To: [email protected]
> Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char
> exception
>
> Thanks for your response Markus, I checked the code and I found the
> workaround you suggested in this file :
>
> *Source:*
> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
>
> and the method was called in this file:
>
> *Invoked:*
> /src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
> like this
> if (e.getKey().equals("content") || e.getKey().equals("title")) {
> val2 = SolrUtils.stripNonCharCodepoints(val);
> }
>
> So if the method is there and apparently invoked at right place. So what
> do you think where the problem could be?
>
> Thanks again for your help.
>
> On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
> > Hi - this is NUTCH-1016, which was never ported to 2.x.
> >
> > https://issues.apache.org/jira/browse/NUTCH-1016
> >
> >
> >
> > -----Original message-----
> >> From:Kshitij Shukla <[email protected]>
> >> Sent: Monday 25th January 2016 8:23
> >> To: [email protected]
> >> Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
> >>
> >> Hello everyone,
> >>
> >> During a very large crawl when indexing to Solr this will yield the
> >> following exception:
> >>
> >> **************************************************
> >> root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
> >> /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
> >> index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> >> mapred.reduce.tasks.speculative.execution=false -D
> >> mapred.map.tasks.speculative.execution=false -D
> >> mapred.compress.map.output=true -D
> >> solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
> >> 16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
> >> 16/01/25 11:44:53 INFO Configuration.deprecation:
> >> mapred.output.key.comparator.class is deprecated. Instead, use
> >> mapreduce.job.output.key.comparator.class
> >> 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
> >> /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
> >> mode: [true]
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
> >> (lib-http)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
> >> (parse-html)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
> >> (parse-metatags)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
> >> (index-html)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
> >> extension points (nutch-extensionpoints)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
> >> Filter (index-basic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
> >> Filter (index-anchor)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
> >> (urlnormalizer-basic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Language
> >> Identification Parser/Filter (language-identifier)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
> >> Filter (index-metadata)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
> >> Parser (lib-nekohtml)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
> >> indexing and query filter (subcollection)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
> >> (indexer-solr)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
> >> Parser/Indexer/Querier (microformats-reltag)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
> >> Protocol Plug-in (protocol-httpclient)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
> >> (parse-js)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
> >> (parse-tika)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
> >> Plugin (tld)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
> >> Framework (lib-regex-filter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
> >> (urlnormalizer-regex)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
> >> Scoring Plug-in (scoring-link)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
> >> (scoring-opic)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
> >> (index-more)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
> >> Plug-in (protocol-http)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
> >> Plugins (creativecommons)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Registered
> >> Extension-Points:
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
> >> (org.apache.nutch.parse.ParseFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
> >> Filter (org.apache.nutch.indexer.IndexCleaningFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
> >> (org.apache.nutch.parse.Parser)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
> >> (org.apache.nutch.net.URLFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
> >> (org.apache.nutch.scoring.ScoringFilter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
> >> (org.apache.nutch.net.URLNormalizer)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
> >> (org.apache.nutch.protocol.Protocol)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
> >> (org.apache.nutch.indexer.IndexWriter)
> >> 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
> >> Filter (org.apache.nutch.indexer.IndexingFilter)
> >> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >> org.apache.nutch.indexer.html.HtmlIndexingFilter
> >> 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
> >> for indexing set to: 100
> >> 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
> >> org.apache.nutch.indexer.basic.BasicIndexingFilter
> >> 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
> >> is: off
> >> 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
> >> job: job_1453472314066_0007
> >> 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
> >> application_1453472314066_0007
> >> 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
> >> http://cism479:8088/proxy/application_1453472314066_0007/
> >> 16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
> >> 16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
> >> in uber mode : false
> >> 16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
> >> 16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
> >> 16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_0, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >> at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >> at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >> at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >> at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >> at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
> >> 16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_1, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >> at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >> at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >> at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >> at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >> at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
> >> 16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
> >> attempt_1453472314066_0007_m_000000_2, Status : FAILED
> >> Error:
> >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> >> [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
> >> char #1296459, byte #1310719)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
> >> at
> >> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
> >> at
> >> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
> >> at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
> >> at
> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
> >> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
> >> at
> >> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
> >> at
> >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
> >> at
> >> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
> >> at
> >> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
> >> at
> >> org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
> >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
> >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
> >> at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
> >> at java.security.AccessController.doPrivileged(Native Method)
> >> at javax.security.auth.Subject.doAs(Subject.java:422)
> >> at
> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
> >> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> >>
> >> 16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
> >> 16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
> >> 16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
> >> with state FAILED due to: Task failed task_1453472314066_0007_m_000000
> >> Job failed as tasks failed. failedMaps:1 failedReduces:0
> >>
> >> 16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
> >> File System Counters
> >> FILE: Number of bytes read=0
> >> FILE: Number of bytes written=116194
> >> FILE: Number of read operations=0
> >> FILE: Number of large read operations=0
> >> FILE: Number of write operations=0
> >> HDFS: Number of bytes read=1033
> >> HDFS: Number of bytes written=0
> >> HDFS: Number of read operations=1
> >> HDFS: Number of large read operations=0
> >> HDFS: Number of write operations=0
> >> Job Counters
> >> Failed map tasks=4
> >> Launched map tasks=5
> >> Other local map tasks=3
> >> Data-local map tasks=2
> >> Total time spent by all maps in occupied slots (ms)=3168342
> >> Total time spent by all reduces in occupied slots (ms)=0
> >> Total time spent by all map tasks (ms)=1056114
> >> Total vcore-seconds taken by all map tasks=1056114
> >> Total megabyte-seconds taken by all map tasks=3244382208
> >> Map-Reduce Framework
> >> Map input records=2762511
> >> Map output records=17629
> >> Input split bytes=1033
> >> Spilled Records=0
> >> Failed Shuffles=0
> >> Merged Map outputs=0
> >> GC time elapsed (ms)=2995
> >> CPU time spent (ms)=116860
> >> Physical memory (bytes) snapshot=1272868864
> >> Virtual memory (bytes) snapshot=5104431104
> >> Total committed heap usage (bytes)=1017118720
> >> IndexerJob
> >> DocumentCount=17629
> >> File Input Format Counters
> >> Bytes Read=0
> >> File Output Format Counters
> >> Bytes Written=0
> >> 16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
> >> java.lang.RuntimeException: job failed: name=[1]Indexer,
> >> jobid=job_1453472314066_0007
> >> at
> >> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
> >> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >> at
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >> at
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> at java.lang.reflect.Method.invoke(Method.java:497)
> >> at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
> >> *******************************************************
> >> --
> >>
> >> Please let me know if you have any questions , concerns or updates.
> >> Have a great day ahead :)
> >>
> >> Thanks and Regards,
> >>
> >> Kshitij Shukla
> >> Software developer
> >>
> >> *Cyber Infrastructure(CIS)
> >> **/The RightSourcing Specialists with 1250 man years of experience!/*
> >>
> >> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
> >> Please don't print this e-mail unless you really need to.
> >>
> >> --
> >>
> >> ------------------------------
> >>
> >> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
> >>
> >> Central India's largest Technology company.
> >>
> >> *Ensuring the success of our clients and partners through our highly
> >> optimized Technology solutions.*
> >>
> >> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> >> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> >> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
> >>
> >> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> >> intended recipient, you should delete this message and are notified that
> >> any disclosure, copying or distribution of this message, or taking any
> >> action based on it, is strictly prohibited by Law.
> >>
>
>
> --
>
> Please let me know if you have any questions , concerns or updates.
> Have a great day ahead :)
>
> Thanks and Regards,
>
> Kshitij Shukla
> Software developer
>
> *Cyber Infrastructure(CIS)
> **/The RightSourcing Specialists with 1250 man years of experience!/*
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>
> Please don't print this e-mail unless you really need to.
>
> --
>
> ------------------------------
>
> *Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
>
> Central India's largest Technology company.
>
> *Ensuring the success of our clients and partners through our highly
> optimized Technology solutions.*
>
> www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
> <https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
> Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
>
> DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
> intended recipient, you should delete this message and are notified that
> any disclosure, copying or distribution of this message, or taking any
> action based on it, is strictly prohibited by Law.
>