I have been trying to get the name of the field, but the error its
showing is kind of generic error and doesnt have any field name
associated with it. I tried to get the name in hadoop log, nutch log and
solr logs. But i didn't find any field name.
Thanks
On Monday 25 January 2016 06:10 PM, Markus Jelsma wrote:
That is odd! Is it on your content or title field?
Markus
-----Original message-----
From:Kshitij Shukla <[email protected]>
Sent: Monday 25th January 2016 11:41
To: [email protected]
Subject: [CIS-CMMI-3] Re: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char
exception
Thanks for your response Markus, I checked the code and I found the
workaround you suggested in this file :
*Source:*
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
and the method was called in this file:
*Invoked:*
/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
like this
if (e.getKey().equals("content") || e.getKey().equals("title")) {
val2 = SolrUtils.stripNonCharCodepoints(val);
}
So if the method is there and apparently invoked at right place. So what
do you think where the problem could be?
Thanks again for your help.
On Monday 25 January 2016 03:35 PM, Markus Jelsma wrote:
Hi - this is NUTCH-1016, which was never ported to 2.x.
https://issues.apache.org/jira/browse/NUTCH-1016
-----Original message-----
From:Kshitij Shukla <[email protected]>
Sent: Monday 25th January 2016 8:23
To: [email protected]
Subject: [CIS-CMMI-3] Invalid UTF-8 character 0xffff at char exception
Hello everyone,
During a very large crawl when indexing to Solr this will yield the
following exception:
**************************************************
root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin#
/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch
index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
16/01/25 11:44:53 INFO Configuration.deprecation:
mapred.output.key.comparator.class is deprecated. Instead, use
mapreduce.job.output.key.comparator.class
16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins
16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in
(parse-html)
16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags
(parse-metatags)
16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter
(index-html)
16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository: XML Libraries (lib-xml)
16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing
Filter (index-anchor)
16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer
(urlnormalizer-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository: Language
Identification Parser/Filter (language-identifier)
16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing
Filter (index-metadata)
16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection
indexing and query filter (subcollection)
16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter
(indexer-solr)
16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat
Parser/Indexer/Querier (microformats-reltag)
16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https
Protocol Plug-in (protocol-httpclient)
16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser
(parse-js)
16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in
(parse-tika)
16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain
Plugin (tld)
16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter
Framework (lib-regex-filter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer
(urlnormalizer-regex)
16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis
Scoring Plug-in (scoring-link)
16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in
(scoring-opic)
16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter
(index-more)
16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons
Plugins (creativecommons)
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning
Filter (org.apache.nutch.indexer.IndexCleaningFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser
(org.apache.nutch.parse.Parser)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter
(org.apache.nutch.net.URLFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer
(org.apache.nutch.indexer.IndexWriter)
16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
org.apache.nutch.indexer.html.HtmlIndexingFilter
16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length
for indexing set to: 100
16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication
is: off
16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1453472314066_0007
16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application
application_1453472314066_0007
16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job:
http://cism479:8088/proxy/application_1453472314066_0007/
16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running
in uber mode : false
16/01/25 11:45:29 INFO mapreduce.Job: map 0% reduce 0%
16/01/25 11:49:24 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: map 0% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_0, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/25 11:52:27 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: map 100% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_1, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/25 11:53:02 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: map 100% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: Task Id :
attempt_1453472314066_0007_m_000000_2, Status : FAILED
Error:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
[com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at
char #1296459, byte #1310719)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43)
at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120)
at
org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
16/01/25 11:54:53 INFO mapreduce.Job: map 50% reduce 0%
16/01/25 11:56:22 INFO mapreduce.Job: map 100% reduce 0%
16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed
with state FAILED due to: Task failed task_1453472314066_0007_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=116194
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1033
HDFS: Number of bytes written=0
HDFS: Number of read operations=1
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Failed map tasks=4
Launched map tasks=5
Other local map tasks=3
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=3168342
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=1056114
Total vcore-seconds taken by all map tasks=1056114
Total megabyte-seconds taken by all map tasks=3244382208
Map-Reduce Framework
Map input records=2762511
Map output records=17629
Input split bytes=1033
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=2995
CPU time spent (ms)=116860
Physical memory (bytes) snapshot=1272868864
Virtual memory (bytes) snapshot=5104431104
Total committed heap usage (bytes)=1017118720
IndexerJob
DocumentCount=17629
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob:
java.lang.RuntimeException: job failed: name=[1]Indexer,
jobid=job_1453472314066_0007
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
*******************************************************
--
Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)
Thanks and Regards,
Kshitij Shukla
Software developer
*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
Please don't print this e-mail unless you really need to.
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
--
Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)
Thanks and Regards,
Kshitij Shukla
Software developer
*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
Please don't print this e-mail unless you really need to.
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
--
Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)
Thanks and Regards,
Kshitij Shukla
Software developer
*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.
Please don't print this e-mail unless you really need to.
--
------------------------------
*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*
Central India's largest Technology company.
*Ensuring the success of our clients and partners through our highly
optimized Technology solutions.*
www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin
<https://www.linkedin.com/company/cyber-infrastructure-private-limited> |
Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.
DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the
intended recipient, you should delete this message and are notified that
any disclosure, copying or distribution of this message, or taking any
action based on it, is strictly prohibited by Law.