I am giving Nutch 2.0 MySQL a go also. I am running into similar issues too.
I changed inlinks on both host and webpage tables to longblob. I also modified the webpage.title to because 512 was not cutting it. I made couple of code changes also. I noticed that my updatedb jobs kept on failing because one of the page had a really long id. It turns out that id contained body text also. To handle this, I skipped pages with id longer than 512 in DbUpdateReducer.reduce. Every page skipped had a malformed id. My current schema is posted below and its running pretty stable for me. I have only gone through few hundred thousand pages, but I seem to have came across all the funny ones. Cheers, Paul ----- my current nutch 2.0 mysql schema CREATE TABLE `host` ( `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '', `inlinks` longblob, `outlinks` longblob, `metadata` longblob, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `webpage` ( `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '', `headers` blob, `text` mediumtext, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL, `content` mediumblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` longblob, `prevSignature` blob, `outlinks` blob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; On Sat, Sep 1, 2012 at 4:25 PM, <[email protected]> wrote: > I am using `content` mediumblob in the SQL table and have not had any > issues so far but that means little as it is probably due to differences > in what we are crawling. However, I am running into a similar problem > with the outlinks column so I suspect the original webpage table > creation script for MySQL has never been tested with real production > runs. I am going to play around with that and see if I can get > something more robust this week. > > -----Original Message----- > From: Matt MacDonald [mailto:[email protected]] > Sent: Friday, August 31, 2012 9:38 AM > To: [email protected] > Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column > 'content' at row 1 > > Yeah I had tried using "MEDIUMBLOB" and ran into the same error message. > > <field name="content" column="content" jdbc-type="MEDIUMBLOB"/> > > It seems likely that we'd want to use HBase in production anyway so I > ended up switching the configuration over to HBase for the crawl and > didn't run into any issues with the content length of the documents. If > anyone knows more about what the mysql issue it might be worth replying > so that others that are encountering the issue don't hit the same dead > end that I did. > > Thanks, > Matt > > On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema > <[email protected]>wrote: > > > Hi, > > > > It might be too obvious but have you already tried to play around with > > > the following line in the mapping (seems like a limit to me)? > > <field name="content" column="content" length="65535"/> > > > > It has been a while since I tried to store big content with the > > SqlStore, so I'm not sure how it works exactly. > > > > Ferdy. > > > > On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <[email protected]> > > wrote: > > > > > Hi, > > > > > > I'm using Nutch 2.0 from the 2.x branch on Github and used > > > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as > > > the storage backend. I'm seeing the following error show up in my > > > hadoop.log file while fetching and wonder if others have ideas for > > > moving past the error without having to set the http.content.limit: > > > > > > java.io.IOException: java.sql.BatchUpdateException: Data truncation: > > > > Data too long for column 'content' at row 1 > > > at > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) > > > at > org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) > > > at > > > > > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java > > :55) > > > at > > > > > > > > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Redu > > ceTask.java:579) > > > at > > > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) > > > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) > > > at > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java: > > > 260) Caused by: java.sql.BatchUpdateException: Data truncation: Data > > > > too long for column 'content' at row 1 > > > at > > > > > > > > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatemen > > t.java:2028) > > > at > > > > > com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1 > > 451) > > > at > > > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) > > > > > > > > > The webpage table in MySQL looks like: > > > > > > mysql> desc webpage; > > > +-------------------+--------------+------+-----+---------+-------+ > > > | Field | Type | Null | Key | Default | Extra | > > > +-------------------+--------------+------+-----+---------+-------+ > > > | id | varchar(512) | NO | PRI | NULL | | > > > | headers | blob | YES | | NULL | | > > > | text | mediumtext | YES | | NULL | | > > > | status | int(11) | YES | | NULL | | > > > | markers | blob | YES | | NULL | | > > > | parseStatus | blob | YES | | NULL | | > > > | modifiedTime | bigint(20) | YES | | NULL | | > > > | score | float | YES | | NULL | | > > > | typ | varchar(32) | YES | | NULL | | > > > | baseUrl | varchar(512) | YES | | NULL | | > > > | content | mediumblob | YES | | NULL | | > > > | title | varchar(512) | YES | | NULL | | > > > | reprUrl | varchar(512) | YES | | NULL | | > > > | fetchInterval | int(11) | YES | | NULL | | > > > | prevFetchTime | bigint(20) | YES | | NULL | | > > > | inlinks | blob | YES | | NULL | | > > > | prevSignature | blob | YES | | NULL | | > > > | outlinks | blob | YES | | NULL | | > > > | fetchTime | bigint(20) | YES | | NULL | | > > > | retriesSinceFetch | int(11) | YES | | NULL | | > > > | protocolStatus | blob | YES | | NULL | | > > > | signature | blob | YES | | NULL | | > > > | metadata | blob | YES | | NULL | | > > > +-------------------+--------------+------+-----+---------+-------+ > > > 23 rows in set (0.00 sec) > > > > > > > > > > > > And my gora-sql-mapping.xml file looks like: > > > > > > <class name="org.apache.nutch.storage.WebPage" > > keyClass="java.lang.String" > > > table="webpage"> > > > <primarykey column="id" length="512"/> > > > <field name="baseUrl" column="baseUrl" length="512"/> > > > <field name="status" column="status"/> > > > <field name="prevFetchTime" column="prevFetchTime"/> > > > <field name="fetchTime" column="fetchTime"/> > > > <field name="fetchInterval" column="fetchInterval"/> > > > <field name="retriesSinceFetch" column="retriesSinceFetch"/> > > > <field name="reprUrl" column="reprUrl" length="512"/> > > > <field name="content" column="content" length="65535"/> > > > <field name="contentType" column="typ" length="32"/> > > > <field name="protocolStatus" column="protocolStatus"/> > > > <field name="modifiedTime" column="modifiedTime"/> > > > > > > <!-- parse fields --> > > > <field name="title" column="title" length="512"/> > > > <field name="text" column="text" length="32000"/> > > > <field name="parseStatus" column="parseStatus"/> > > > <field name="signature" column="signature"/> > > > <field name="prevSignature" column="prevSignature"/> > > > > > > <!-- score fields --> > > > <field name="score" column="score"/> > > > <field name="headers" column="headers"/> > > > <field name="inlinks" column="inlinks"/> > > > <field name="outlinks" column="outlinks"/> > > > <field name="metadata" column="metadata"/> > > > <field name="markers" column="markers"/> </class> > > > > > > > > > The http.content.limit in nutch-default.xml looks like: > > > > > > <property> > > > <name>http.content.limit</name> > > > <value>-1</value> > > > <description>The length limit for downloaded content using the > http > > > protocol, in bytes. If this value is nonnegative (>=0), content > longer > > > than it will be truncated; otherwise, no truncation at all. Do not > > > confuse this setting with the file.content.limit setting. > > > </description> > > > </property> > > > > > > > > > My my.cnf file looks like: > > > > > > [mysqld] > > > max_allowed_packet = 200M > > > > > > character-set-server = utf8 > > > collation-server=utf8_unicode_ci > > > > > > > > > I've tested that I could change the http.content.limit property to > be a > > > nonnegative number (65535) and the fetch job completes, but I want > to > > have > > > the non-truncated content available so that I'm crawling all links > on the > > > page and storing the entire contents of the document so that I can > then > > > index the entire text in Elasticsearch. Any ideas on how I can fetch > and > > > store the full content in MySQL? If the answer is - use HBase I'll > do > > that > > > I'm just trying to remove another new variable as I learn more about > how > > > Nutch works. With the content.limit set my crawl completes but I'm > > missing > > > nearly a 3rd of the documents that I would expect because the > content is > > > being truncated? > > > > > > Thanks for any advice you can offer. > > > > > > Thanks, > > > Matt > > > > > >

