I have been trying to get the name of the field, but the error its showing is kind of generic error and doesnt have any field name associated with it. I tried to get the name in hadoop log, nutch log and solr logs. But i didn't find any field name.

Thanks

On Monday 25 January 2016 06:10 PM, Markus Jelsma wrote:
That is odd! Is it on your content or title field?
Markus
-----Original message-----
From:Kshitij Shukla <[email protected]>
Sent: Monday 25th January 2016 11:41
To: [email protected]
Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char 
exception

Thanks for your response Markus, I checked the code and I found the
workaround you suggested in this file :

*Source:*
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java

and the method was called in this file:

*Invoked:*
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
like this
          if (e.getKey().equals("content") || e.getKey().equals("title")) {
                      val2 = SolrUtils.stripNonCharCodepoints(val);
          }

So if the method is there and apparently invoked at right place. So what
do you think where the problem could be?

Thanks again for your help.

On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
Hi - this is NUTCH-1016, which was never ported to 2.x.

https://issues.apache.org/jira/browse/NUTCH-1016

-----Original message-----
From:Kshitij Shukla <[email protected]>
Sent: Monday 25th January 2016 8:23
To: [email protected]
Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception

Hello everyone,

During a very large crawl when indexing to Solr this will yield the
following exception:

**************************************************
root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
16/01/25 11:44:53 INFO Configuration.deprecation:
mapred.output.key.comparator.class is deprecated. Instead, use
mapreduce.job.output.key.comparator.class
16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
16/01/25 11:44:54 INFO plugin.PluginRepository:     HTTP Framework
(lib-http)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Parse Plug-in
(parse-html)
16/01/25 11:44:54 INFO plugin.PluginRepository:     MetaTags
(parse-metatags)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Html Indexing Filter
(index-html)
16/01/25 11:44:54 INFO plugin.PluginRepository:     the nutch core
extension points (nutch-extensionpoints)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic Indexing
Filter (index-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Anchor Indexing
Filter (index-anchor)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Basic URL Normalizer
(urlnormalizer-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Language
Identification Parser/Filter (language-identifier)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Metadata Indexing
Filter (index-metadata)
16/01/25 11:44:54 INFO plugin.PluginRepository:     CyberNeko HTML
Parser (lib-nekohtml)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Subcollection
indexing and query filter (subcollection)
16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
(indexer-solr)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Http / Https
Protocol Plug-in (protocol-httpclient)
16/01/25 11:44:54 INFO plugin.PluginRepository:     JavaScript Parser
(parse-js)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Tika Parser Plug-in
(parse-tika)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Top Level Domain
Plugin (tld)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Filter
Framework (lib-regex-filter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Regex URL Normalizer
(urlnormalizer-regex)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Link Analysis
Scoring Plug-in (scoring-link)
16/01/25 11:44:54 INFO plugin.PluginRepository:     OPIC Scoring Plug-in
(scoring-opic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     More Indexing Filter
(index-more)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Http Protocol
Plug-in (protocol-http)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Creative Commons
Plugins (creativecommons)
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/25 11:44:54 INFO plugin.PluginRepository:     Parse Filter
(org.apache.nutch.parse.ParseFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Cleaning
Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Content Parser
(org.apache.nutch.parse.Parser)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Filter
(org.apache.nutch.net.URLFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Protocol
(org.apache.nutch.protocol.Protocol)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
16/01/25 11:44:54 INFO plugin.PluginRepository:     Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
org.apache.nutch.indexer.html.HtmlIndexingFilter
16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
for indexing set to: 100
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
is: off
16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1453472314066_0007
16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
application_1453472314066_0007
16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
http://cism479:8088/proxy/application_1453472314066_0007/
16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
in uber mode : false
16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_0, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
       at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
       at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
       at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
       at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
       at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
       at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
       at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
       at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
       at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
       at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
       at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_1, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
       at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
       at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
       at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
       at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
       at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
       at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
       at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
       at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
       at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
       at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
       at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_2, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
       at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
       at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
       at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
       at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
       at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
       at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
       at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
       at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
       at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
       at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
       at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
       at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
       at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:422)
       at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
with state FAILED due to: Task failed task_1453472314066_0007_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
       File System Counters
           FILE: Number of bytes read=0
           FILE: Number of bytes written=116194
           FILE: Number of read operations=0
           FILE: Number of large read operations=0
           FILE: Number of write operations=0
           HDFS: Number of bytes read=1033
           HDFS: Number of bytes written=0
           HDFS: Number of read operations=1
           HDFS: Number of large read operations=0
           HDFS: Number of write operations=0
       Job Counters
           Failed map tasks=4
           Launched map tasks=5
           Other local map tasks=3
           Data-local map tasks=2
           Total time spent by all maps in occupied slots (ms)=3168342
           Total time spent by all reduces in occupied slots (ms)=0
           Total time spent by all map tasks (ms)=1056114
           Total vcore-seconds taken by all map tasks=1056114
           Total megabyte-seconds taken by all map tasks=3244382208
       Map-Reduce Framework
           Map input records=2762511
           Map output records=17629
           Input split bytes=1033
           Spilled Records=0
           Failed Shuffles=0
           Merged Map outputs=0
           GC time elapsed (ms)=2995
           CPU time spent (ms)=116860
           Physical memory (bytes) snapshot=1272868864
           Virtual memory (bytes) snapshot=5104431104
           Total committed heap usage (bytes)=1017118720
       IndexerJob
           DocumentCount=17629
       File Input Format Counters
           Bytes Read=0
       File Output Format Counters
           Bytes Written=0
16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
java.lang.RuntimeException: job failed: name=[1]Indexer,
jobid=job_1453472314066_0007
       at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
       at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
       at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:497)
       at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
*******************************************************
--

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

--

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.


--

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

--

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.



--

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the intended recipient, you should delete this message and are notified that any disclosure, copying or distribution of this message, or taking any action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

--

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the intended recipient, you should delete this message and are notified that any disclosure, copying or distribution of this message, or taking any action based on it, is strictly prohibited by Law.

Reply via email to