Hi, It might be too obvious but have you already tried to play around with the following line in the mapping (seems like a limit to me)? <field name="content" column="content" length="65535"/>
It has been a while since I tried to store big content with the SqlStore, so I'm not sure how it works exactly. Ferdy. On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <[email protected]> wrote: > Hi, > > I'm using Nutch 2.0 from the 2.x branch on Github and used > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as the > storage backend. I'm seeing the following error show up in my hadoop.log > file while fetching and wonder if others have ideas for moving past the > error without having to set the http.content.limit: > > java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data > too long for column 'content' at row 1 > at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) > at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) > at > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) > at > > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) > Caused by: java.sql.BatchUpdateException: Data truncation: Data too long > for column 'content' at row 1 > at > > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028) > at > com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451) > at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) > > > The webpage table in MySQL looks like: > > mysql> desc webpage; > +-------------------+--------------+------+-----+---------+-------+ > | Field | Type | Null | Key | Default | Extra | > +-------------------+--------------+------+-----+---------+-------+ > | id | varchar(512) | NO | PRI | NULL | | > | headers | blob | YES | | NULL | | > | text | mediumtext | YES | | NULL | | > | status | int(11) | YES | | NULL | | > | markers | blob | YES | | NULL | | > | parseStatus | blob | YES | | NULL | | > | modifiedTime | bigint(20) | YES | | NULL | | > | score | float | YES | | NULL | | > | typ | varchar(32) | YES | | NULL | | > | baseUrl | varchar(512) | YES | | NULL | | > | content | mediumblob | YES | | NULL | | > | title | varchar(512) | YES | | NULL | | > | reprUrl | varchar(512) | YES | | NULL | | > | fetchInterval | int(11) | YES | | NULL | | > | prevFetchTime | bigint(20) | YES | | NULL | | > | inlinks | blob | YES | | NULL | | > | prevSignature | blob | YES | | NULL | | > | outlinks | blob | YES | | NULL | | > | fetchTime | bigint(20) | YES | | NULL | | > | retriesSinceFetch | int(11) | YES | | NULL | | > | protocolStatus | blob | YES | | NULL | | > | signature | blob | YES | | NULL | | > | metadata | blob | YES | | NULL | | > +-------------------+--------------+------+-----+---------+-------+ > 23 rows in set (0.00 sec) > > > > And my gora-sql-mapping.xml file looks like: > > <class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String" > table="webpage"> > <primarykey column="id" length="512"/> > <field name="baseUrl" column="baseUrl" length="512"/> > <field name="status" column="status"/> > <field name="prevFetchTime" column="prevFetchTime"/> > <field name="fetchTime" column="fetchTime"/> > <field name="fetchInterval" column="fetchInterval"/> > <field name="retriesSinceFetch" column="retriesSinceFetch"/> > <field name="reprUrl" column="reprUrl" length="512"/> > <field name="content" column="content" length="65535"/> > <field name="contentType" column="typ" length="32"/> > <field name="protocolStatus" column="protocolStatus"/> > <field name="modifiedTime" column="modifiedTime"/> > > <!-- parse fields --> > <field name="title" column="title" length="512"/> > <field name="text" column="text" length="32000"/> > <field name="parseStatus" column="parseStatus"/> > <field name="signature" column="signature"/> > <field name="prevSignature" column="prevSignature"/> > > <!-- score fields --> > <field name="score" column="score"/> > <field name="headers" column="headers"/> > <field name="inlinks" column="inlinks"/> > <field name="outlinks" column="outlinks"/> > <field name="metadata" column="metadata"/> > <field name="markers" column="markers"/> > </class> > > > The http.content.limit in nutch-default.xml looks like: > > <property> > <name>http.content.limit</name> > <value>-1</value> > <description>The length limit for downloaded content using the http > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > </description> > </property> > > > My my.cnf file looks like: > > [mysqld] > max_allowed_packet = 200M > > character-set-server = utf8 > collation-server=utf8_unicode_ci > > > I've tested that I could change the http.content.limit property to be a > nonnegative number (65535) and the fetch job completes, but I want to have > the non-truncated content available so that I'm crawling all links on the > page and storing the entire contents of the document so that I can then > index the entire text in Elasticsearch. Any ideas on how I can fetch and > store the full content in MySQL? If the answer is - use HBase I'll do that > I'm just trying to remove another new variable as I learn more about how > Nutch works. With the content.limit set my crawl completes but I'm missing > nearly a 3rd of the documents that I would expect because the content is > being truncated? > > Thanks for any advice you can offer. > > Thanks, > Matt >

