Hello everyone,

During a very large crawl when indexing to Solr this will yield the following exception:

**************************************************
root@cism479:/usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin# /usr/share/searchEngine/apache-nutch-2.3.1/runtime/deploy/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/ddcds -all -crawlId 1
16/01/25 11:44:52 INFO indexer.IndexingJob: IndexingJob: starting
16/01/25 11:44:53 INFO Configuration.deprecation: mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class 16/01/25 11:44:53 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-root/hadoop-unjar4772724649160367470/classes/plugins 16/01/25 11:44:54 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true]
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Plugins:
16/01/25 11:44:54 INFO plugin.PluginRepository: HTTP Framework (lib-http) 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 16/01/25 11:44:54 INFO plugin.PluginRepository: MetaTags (parse-metatags) 16/01/25 11:44:54 INFO plugin.PluginRepository: Html Indexing Filter (index-html) 16/01/25 11:44:54 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic)
16/01/25 11:44:54 INFO plugin.PluginRepository:     XML Libraries (lib-xml)
16/01/25 11:44:54 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 16/01/25 11:44:54 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 16/01/25 11:44:54 INFO plugin.PluginRepository: Language Identification Parser/Filter (language-identifier) 16/01/25 11:44:54 INFO plugin.PluginRepository: Metadata Indexing Filter (index-metadata) 16/01/25 11:44:54 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 16/01/25 11:44:54 INFO plugin.PluginRepository: Subcollection indexing and query filter (subcollection) 16/01/25 11:44:54 INFO plugin.PluginRepository: SOLRIndexWriter (indexer-solr) 16/01/25 11:44:54 INFO plugin.PluginRepository: Rel-Tag microformat Parser/Indexer/Querier (microformats-reltag) 16/01/25 11:44:54 INFO plugin.PluginRepository: Http / Https Protocol Plug-in (protocol-httpclient) 16/01/25 11:44:54 INFO plugin.PluginRepository: JavaScript Parser (parse-js) 16/01/25 11:44:54 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 16/01/25 11:44:54 INFO plugin.PluginRepository: Top Level Domain Plugin (tld) 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 16/01/25 11:44:54 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 16/01/25 11:44:54 INFO plugin.PluginRepository: Link Analysis Scoring Plug-in (scoring-link) 16/01/25 11:44:54 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 16/01/25 11:44:54 INFO plugin.PluginRepository: More Indexing Filter (index-more) 16/01/25 11:44:54 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 16/01/25 11:44:54 INFO plugin.PluginRepository: Creative Commons Plugins (creativecommons)
16/01/25 11:44:54 INFO plugin.PluginRepository: Registered Extension-Points:
16/01/25 11:44:54 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Cleaning Filter (org.apache.nutch.indexer.IndexCleaningFilter) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Index Writer (org.apache.nutch.indexer.IndexWriter) 16/01/25 11:44:54 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.html.HtmlIndexingFilter 16/01/25 11:44:54 INFO basic.BasicIndexingFilter: Maximum title length for indexing set to: 100 16/01/25 11:44:54 INFO indexer.IndexingFilters: Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 16/01/25 11:44:54 INFO anchor.AnchorIndexingFilter: Anchor deduplication is: off 16/01/25 11:45:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1453472314066_0007 16/01/25 11:45:08 INFO impl.YarnClientImpl: Submitted application application_1453472314066_0007 16/01/25 11:45:09 INFO mapreduce.Job: The url to track the job: http://cism479:8088/proxy/application_1453472314066_0007/
16/01/25 11:45:09 INFO mapreduce.Job: Running job: job_1453472314066_0007
16/01/25 11:45:29 INFO mapreduce.Job: Job job_1453472314066_0007 running in uber mode : false
16/01/25 11:45:29 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 11:49:24 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job:  map 0% reduce 0%
16/01/25 11:49:29 INFO mapreduce.Job: Task Id : attempt_1453472314066_0007_m_000000_0, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #1296459, byte #1310719) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120) at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:52:27 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:53:01 INFO mapreduce.Job: Task Id : attempt_1453472314066_0007_m_000000_1, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #1296459, byte #1310719) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120) at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:53:02 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:54:52 INFO mapreduce.Job: Task Id : attempt_1453472314066_0007_m_000000_2, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [com.ctc.wstx.exc.WstxLazyException] Invalid UTF-8 character 0xffff at char #1296459, byte #1310719) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
    at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:84)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:84)
at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:48) at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:43) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:635) at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:120) at org.apache.nutch.indexer.IndexingJob$IndexerMapper.map(IndexingJob.java:69)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

16/01/25 11:54:53 INFO mapreduce.Job:  map 50% reduce 0%
16/01/25 11:56:22 INFO mapreduce.Job:  map 100% reduce 0%
16/01/25 11:56:23 INFO mapreduce.Job: Job job_1453472314066_0007 failed with state FAILED due to: Task failed task_1453472314066_0007_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

16/01/25 11:56:23 INFO mapreduce.Job: Counters: 33
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=116194
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1033
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=1
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters
        Failed map tasks=4
        Launched map tasks=5
        Other local map tasks=3
        Data-local map tasks=2
        Total time spent by all maps in occupied slots (ms)=3168342
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=1056114
        Total vcore-seconds taken by all map tasks=1056114
        Total megabyte-seconds taken by all map tasks=3244382208
    Map-Reduce Framework
        Map input records=2762511
        Map output records=17629
        Input split bytes=1033
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=2995
        CPU time spent (ms)=116860
        Physical memory (bytes) snapshot=1272868864
        Virtual memory (bytes) snapshot=5104431104
        Total committed heap usage (bytes)=1017118720
    IndexerJob
        DocumentCount=17629
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=0
16/01/25 11:56:23 ERROR indexer.IndexingJob: SolrIndexerJob: java.lang.RuntimeException: job failed: name=[1]Indexer, jobid=job_1453472314066_0007
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
*******************************************************
--

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the intended recipient, you should delete this message and are notified that any disclosure, copying or distribution of this message, or taking any action based on it, is strictly prohibited by Law.

Please don't print this e-mail unless you really need to.

--

------------------------------

*Cyber Infrastructure (P) Limited, [CIS] **(CMMI Level 3 Certified)*

Central India's largest Technology company.

*Ensuring the success of our clients and partners through our highly optimized Technology solutions.*

www.cisin.com | +Cisin <https://plus.google.com/+Cisin/> | Linkedin <https://www.linkedin.com/company/cyber-infrastructure-private-limited> | Offices: *Indore, India.* *Singapore. Silicon Valley, USA*.

DISCLAIMER: INFORMATION PRIVACY is important for us, If you are not the intended recipient, you should delete this message and are notified that any disclosure, copying or distribution of this message, or taking any action based on it, is strictly prohibited by Law.

Reply via email to