Thanks Paul. I had the table change part figured out already but your comment about malformed ids and DbUpdateReducer.reduce was really helpful (and something worth its own thread). With the proper table set up there should be no problem with size with MySQL. However the URL parsing for the id has issues and non-Ascii characters cause problems in various places.
-----Original Message----- From: Paul Dhaliwal [mailto:[email protected]] Sent: Sunday, September 02, 2012 2:54 PM To: [email protected] Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column 'content' at row 1 I am giving Nutch 2.0 MySQL a go also. I am running into similar issues too. I changed inlinks on both host and webpage tables to longblob. I also modified the webpage.title to because 512 was not cutting it. I made couple of code changes also. I noticed that my updatedb jobs kept on failing because one of the page had a really long id. It turns out that id contained body text also. To handle this, I skipped pages with id longer than 512 in DbUpdateReducer.reduce. Every page skipped had a malformed id. My current schema is posted below and its running pretty stable for me. I have only gone through few hundred thousand pages, but I seem to have came across all the funny ones. Cheers, Paul ----- my current nutch 2.0 mysql schema CREATE TABLE `host` ( `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '', `inlinks` longblob, `outlinks` longblob, `metadata` longblob, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `webpage` ( `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '', `headers` blob, `text` mediumtext, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL, `content` mediumblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` longblob, `prevSignature` blob, `outlinks` blob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; On Sat, Sep 1, 2012 at 4:25 PM, <[email protected]> wrote: > I am using `content` mediumblob in the SQL table and have not had any > issues so far but that means little as it is probably due to > differences in what we are crawling. However, I am running into a > similar problem with the outlinks column so I suspect the original > webpage table creation script for MySQL has never been tested with > real production runs. I am going to play around with that and see if > I can get something more robust this week. > > -----Original Message----- > From: Matt MacDonald [mailto:[email protected]] > Sent: Friday, August 31, 2012 9:38 AM > To: [email protected] > Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column > 'content' at row 1 > > Yeah I had tried using "MEDIUMBLOB" and ran into the same error message. > > <field name="content" column="content" jdbc-type="MEDIUMBLOB"/> > > It seems likely that we'd want to use HBase in production anyway so I > ended up switching the configuration over to HBase for the crawl and > didn't run into any issues with the content length of the documents. > If anyone knows more about what the mysql issue it might be worth > replying so that others that are encountering the issue don't hit the > same dead end that I did. > > Thanks, > Matt > > On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema > <[email protected]>wrote: > > > Hi, > > > > It might be too obvious but have you already tried to play around > > with > > > the following line in the mapping (seems like a limit to me)? > > <field name="content" column="content" length="65535"/> > > > > It has been a while since I tried to store big content with the > > SqlStore, so I'm not sure how it works exactly. > > > > Ferdy. > > > > On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <[email protected]> > > wrote: > > > > > Hi, > > > > > > I'm using Nutch 2.0 from the 2.x branch on Github and used > > > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL > > > as the storage backend. I'm seeing the following error show up in > > > my hadoop.log file while fetching and wonder if others have ideas > > > for moving past the error without having to set the http.content.limit: > > > > > > java.io.IOException: java.sql.BatchUpdateException: Data truncation: > > > > Data too long for column 'content' at row 1 > > > at > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) > > > at > org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) > > > at > > > > > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.ja > > va > > :55) > > > at > > > > > > > > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Re > > du > > ceTask.java:579) > > > at > > > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) > > > at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) > > > at > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java: > > > 260) Caused by: java.sql.BatchUpdateException: Data truncation: > > > Data > > > > too long for column 'content' at row 1 > > > at > > > > > > > > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatem > > en > > t.java:2028) > > > at > > > > > com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java > > :1 > > 451) > > > at > > > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) > > > > > > > > > The webpage table in MySQL looks like: > > > > > > mysql> desc webpage; > > > +-------------------+--------------+------+-----+---------+-------+ > > > | Field | Type | Null | Key | Default | Extra | > > > +-------------------+--------------+------+-----+---------+-------+ > > > | id | varchar(512) | NO | PRI | NULL | | > > > | headers | blob | YES | | NULL | | > > > | text | mediumtext | YES | | NULL | | > > > | status | int(11) | YES | | NULL | | > > > | markers | blob | YES | | NULL | | > > > | parseStatus | blob | YES | | NULL | | > > > | modifiedTime | bigint(20) | YES | | NULL | | > > > | score | float | YES | | NULL | | > > > | typ | varchar(32) | YES | | NULL | | > > > | baseUrl | varchar(512) | YES | | NULL | | > > > | content | mediumblob | YES | | NULL | | > > > | title | varchar(512) | YES | | NULL | | > > > | reprUrl | varchar(512) | YES | | NULL | | > > > | fetchInterval | int(11) | YES | | NULL | | > > > | prevFetchTime | bigint(20) | YES | | NULL | | > > > | inlinks | blob | YES | | NULL | | > > > | prevSignature | blob | YES | | NULL | | > > > | outlinks | blob | YES | | NULL | | > > > | fetchTime | bigint(20) | YES | | NULL | | > > > | retriesSinceFetch | int(11) | YES | | NULL | | > > > | protocolStatus | blob | YES | | NULL | | > > > | signature | blob | YES | | NULL | | > > > | metadata | blob | YES | | NULL | | > > > +-------------------+--------------+------+-----+---------+-------+ > > > 23 rows in set (0.00 sec) > > > > > > > > > > > > And my gora-sql-mapping.xml file looks like: > > > > > > <class name="org.apache.nutch.storage.WebPage" > > keyClass="java.lang.String" > > > table="webpage"> > > > <primarykey column="id" length="512"/> > > > <field name="baseUrl" column="baseUrl" length="512"/> > > > <field name="status" column="status"/> > > > <field name="prevFetchTime" column="prevFetchTime"/> > > > <field name="fetchTime" column="fetchTime"/> > > > <field name="fetchInterval" column="fetchInterval"/> > > > <field name="retriesSinceFetch" column="retriesSinceFetch"/> > > > <field name="reprUrl" column="reprUrl" length="512"/> > > > <field name="content" column="content" length="65535"/> > > > <field name="contentType" column="typ" length="32"/> > > > <field name="protocolStatus" column="protocolStatus"/> > > > <field name="modifiedTime" column="modifiedTime"/> > > > > > > <!-- parse fields --> > > > <field name="title" column="title" length="512"/> > > > <field name="text" column="text" length="32000"/> > > > <field name="parseStatus" column="parseStatus"/> > > > <field name="signature" column="signature"/> > > > <field name="prevSignature" column="prevSignature"/> > > > > > > <!-- score fields --> > > > <field name="score" column="score"/> > > > <field name="headers" column="headers"/> > > > <field name="inlinks" column="inlinks"/> > > > <field name="outlinks" column="outlinks"/> > > > <field name="metadata" column="metadata"/> > > > <field name="markers" column="markers"/> </class> > > > > > > > > > The http.content.limit in nutch-default.xml looks like: > > > > > > <property> > > > <name>http.content.limit</name> > > > <value>-1</value> > > > <description>The length limit for downloaded content using the > http > > > protocol, in bytes. If this value is nonnegative (>=0), content > longer > > > than it will be truncated; otherwise, no truncation at all. Do not > > > confuse this setting with the file.content.limit setting. > > > </description> > > > </property> > > > > > > > > > My my.cnf file looks like: > > > > > > [mysqld] > > > max_allowed_packet = 200M > > > > > > character-set-server = utf8 > > > collation-server=utf8_unicode_ci > > > > > > > > > I've tested that I could change the http.content.limit property to > be a > > > nonnegative number (65535) and the fetch job completes, but I want > to > > have > > > the non-truncated content available so that I'm crawling all links > on the > > > page and storing the entire contents of the document so that I can > then > > > index the entire text in Elasticsearch. Any ideas on how I can > > > fetch > and > > > store the full content in MySQL? If the answer is - use HBase I'll > do > > that > > > I'm just trying to remove another new variable as I learn more > > > about > how > > > Nutch works. With the content.limit set my crawl completes but I'm > > missing > > > nearly a 3rd of the documents that I would expect because the > content is > > > being truncated? > > > > > > Thanks for any advice you can offer. > > > > > > Thanks, > > > Matt > > > > > >

