Failed again with Hsql 2.2.8 after 2 hours' crawling. Should I go back to Nutch 1.5 or 1.6? It seems there are too many issues in Nutch 2.1. What a pity.
console: Skipping http://blog.sina.com.cn/s/blog_blog_557f024c010.html; different batch id (null) Exception in thread "main" java.lang.RuntimeException: job failed: name=parse, jobid=job_local_0008 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54) at org.apache.nutch.parse.ParserJob.run(ParserJob.java:251) at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68) at org.apache.nutch.crawl.Crawler.run(Crawler.java:171) at org.apache.nutch.crawl.Crawler.run(Crawler.java:250) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawler.main(Crawler.java:257) hadoop.log 2013-01-04 02:42:53,292 INFO parse.ParserJob - Skipping http://blog.sina.com.cn/s/blog_70b99cd80102ebqv.html; different batch id (null) 2013-01-04 02:43:07,412 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2013-01-04 02:43:07,436 WARN mapred.LocalJobRunner - job_local_0008 java.io.IOException: java.sql.BatchUpdateException: data exception: string data, right truncation at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Caused by: java.sql.BatchUpdateException: data exception: string data, right truncation at org.hsqldb.jdbc.JDBCPreparedStatement.executeBatch(Unknown Source) at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) ... 6 more At 2013-01-03 21:52:35,"Renato Marroquín Mogrovejo" <[email protected]> wrote: >Hi Rui, > >The way this works is that Nutch uses the gora-sql-mapping.xml file to >create automatically the necessary tables and then use them. Anyways, >IMHO I think you are hitting [1] which means you could try changing >the gora-sql-mapping.xml file to what has been discussed on JIRA and >then let us know so we can narrow it down. >Thanks! > > >Renato M. > >[1] https://issues.apache.org/jira/browse/GORA-24 > >2013/1/3 高睿 <[email protected]>: >> BTW, could you please share me the schema of webpage table or creation >> script? >> It seems the table auto-generated by nutch2.1 have problems. >> >> >> >> >> >> >> At 2013-01-03 21:43:26,"高睿" <[email protected]> wrote: >> >> I'm using this command: >> bin/nutch crawl urls -solr http://localhost:8080/solr/collection2 -threads >> 10 -depth 2 -topN 1000 >> I guess the exception occurs when it try to store webpage into HSql. I tried >> to increase the column size, but it fails again. Here's the schema for HSql: >> sql> \d webpage >> NAME DATATYPE WIDTH NO-NULLS PRECISION SCALE >> ----------------- -------- -------- -------- --------- ----- >> ID VARCHAR 767 * 767 >> HEADERS BLOB 16777216 16777216 >> TEXT VARCHAR 16777216 16777216 >> STATUS INTEGER 11 32 >> MARKERS BLOB 16777216 16777216 >> PARSESTATUS BLOB 16777216 16777216 >> MODIFIEDTIME BIGINT 20 64 >> SCORE DOUBLE 23 64 >> TYP VARCHAR 32 32 >> BASEURL VARCHAR 767 767 >> CONTENT BLOB 16777216 16777216 >> TITLE VARCHAR 2048 2048 >> REPRURL VARCHAR 767 767 >> FETCHINTERVAL INTEGER 11 32 >> PREVFETCHTIME BIGINT 20 64 >> INLINKS BLOB 16777216 16777216 >> PREVSIGNATURE BLOB 16777216 16777216 >> OUTLINKS BLOB 16777216 16777216 >> FETCHTIME BIGINT 20 64 >> RETRIESSINCEFETCH INTEGER 11 32 >> PROTOCOLSTATUS BLOB 16777216 16777216 >> SIGNATURE BLOB 16777216 16777216 >> METADATA BLOB 16777216 16777216 >> >> >> >> >> >> >> >> At 2013-01-03 21:06:04,"Lewis John Mcgibbney" <[email protected]> >> wrote: >>>Hi Rui, >>> >>>The gora-sql backend is not stable so please do not be surprised if things >>>do not work flawlessly. >>> >>>I would urge you to have a look at the gora-sql-mapping.xml file [0] and >>>check the respective field values for the columns you are attempting to map. >>> >>>This aside, I would use the following SQL Store implementations if I were >>>going to use this backend >>> >>>HSQLDB - 2.2.8 >>>MySQL - 5.1.18 >>> >>>Which stage (in your Nutch processes) does this Exception occur? >>> >>>Lewis >>> >>>[0] >>>http://svn.apache.org/repos/asf/nutch/branches/2.x/conf/gora-sql-mapping.xml >>> >>>On Thu, Jan 3, 2013 at 9:34 AM, 高睿 <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I can't run Nutch 2.1 with Mysql, then I tried Hsql, failed again. So, >>>> which database are you using for nutch 2.1. I spent too much time on this >>>> and can not make it work. >>>> >>>> 2013-01-03 16:12:06,812 WARN mapred.FileOutputCommitter - Output path is >>>> null in cleanup >>>> 2013-01-03 16:12:06,835 WARN mapred.LocalJobRunner - job_local_0008 >>>> java.io.IOException: java.sql.BatchUpdateException: data exception: string >>>> data, right truncation >>>> at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) >>>> at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) >>>> at >>>> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) >>>> at >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:651) >>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) >>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >>>> at >>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >>>> Caused by: java.sql.BatchUpdateException: data exception: string data, >>>> right truncation >>>> at org.hsqldb.jdbc.JDBCPreparedStatement.executeBatch(Unknown >>>> Source) >>>> at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) >>>> ... 6 more >>>> >>>> Regards, >>>> Rui >>>> >>> >>> >>> >>>-- >>>*Lewis* >> >> >>

