Hi,

It might be too obvious but have you already tried to play around with the
following line in the mapping (seems like a limit to me)?
<field name="content" column="content" length="65535"/>

It has been a while since I tried to store big content with the SqlStore,
so I'm not sure how it works exactly.

Ferdy.

On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <[email protected]> wrote:

> Hi,
>
> I'm using Nutch 2.0 from the 2.x branch on Github and used
> http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as the
> storage backend. I'm seeing the following error show up in my hadoop.log
> file while fetching and wonder if others have ideas for moving past the
> error without having to set the http.content.limit:
>
> java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data
> too long for column 'content' at row 1
>         at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
>         at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
>         at
> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
>         at
>
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579)
>         at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
> Caused by: java.sql.BatchUpdateException: Data truncation: Data too long
> for column 'content' at row 1
>         at
>
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
>         at
> com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
>         at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
>
>
> The webpage table in MySQL looks like:
>
> mysql> desc webpage;
> +-------------------+--------------+------+-----+---------+-------+
> | Field             | Type         | Null | Key | Default | Extra |
> +-------------------+--------------+------+-----+---------+-------+
> | id                | varchar(512) | NO   | PRI | NULL    |       |
> | headers           | blob         | YES  |     | NULL    |       |
> | text              | mediumtext   | YES  |     | NULL    |       |
> | status            | int(11)      | YES  |     | NULL    |       |
> | markers           | blob         | YES  |     | NULL    |       |
> | parseStatus       | blob         | YES  |     | NULL    |       |
> | modifiedTime      | bigint(20)   | YES  |     | NULL    |       |
> | score             | float        | YES  |     | NULL    |       |
> | typ               | varchar(32)  | YES  |     | NULL    |       |
> | baseUrl           | varchar(512) | YES  |     | NULL    |       |
> | content           | mediumblob   | YES  |     | NULL    |       |
> | title             | varchar(512) | YES  |     | NULL    |       |
> | reprUrl           | varchar(512) | YES  |     | NULL    |       |
> | fetchInterval     | int(11)      | YES  |     | NULL    |       |
> | prevFetchTime     | bigint(20)   | YES  |     | NULL    |       |
> | inlinks           | blob         | YES  |     | NULL    |       |
> | prevSignature     | blob         | YES  |     | NULL    |       |
> | outlinks          | blob         | YES  |     | NULL    |       |
> | fetchTime         | bigint(20)   | YES  |     | NULL    |       |
> | retriesSinceFetch | int(11)      | YES  |     | NULL    |       |
> | protocolStatus    | blob         | YES  |     | NULL    |       |
> | signature         | blob         | YES  |     | NULL    |       |
> | metadata          | blob         | YES  |     | NULL    |       |
> +-------------------+--------------+------+-----+---------+-------+
> 23 rows in set (0.00 sec)
>
>
>
> And my gora-sql-mapping.xml file looks like:
>
> <class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String"
> table="webpage">
>   <primarykey column="id" length="512"/>
>     <field name="baseUrl" column="baseUrl" length="512"/>
>     <field name="status" column="status"/>
>     <field name="prevFetchTime" column="prevFetchTime"/>
>     <field name="fetchTime" column="fetchTime"/>
>     <field name="fetchInterval" column="fetchInterval"/>
>     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
>     <field name="reprUrl" column="reprUrl" length="512"/>
>     <field name="content" column="content" length="65535"/>
>     <field name="contentType" column="typ" length="32"/>
>     <field name="protocolStatus" column="protocolStatus"/>
>     <field name="modifiedTime" column="modifiedTime"/>
>
>     <!-- parse fields                                       -->
>     <field name="title" column="title" length="512"/>
>     <field name="text" column="text" length="32000"/>
>     <field name="parseStatus" column="parseStatus"/>
>     <field name="signature" column="signature"/>
>     <field name="prevSignature" column="prevSignature"/>
>
>     <!-- score fields                                       -->
>     <field name="score" column="score"/>
>     <field name="headers" column="headers"/>
>     <field name="inlinks" column="inlinks"/>
>     <field name="outlinks" column="outlinks"/>
>     <field name="metadata" column="metadata"/>
>     <field name="markers" column="markers"/>
> </class>
>
>
> The http.content.limit in nutch-default.xml looks like:
>
> <property>
>   <name>http.content.limit</name>
>   <value>-1</value>
>   <description>The length limit for downloaded content using the http
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   </description>
> </property>
>
>
> My my.cnf file looks like:
>
> [mysqld]
> max_allowed_packet             = 200M
>
> character-set-server = utf8
> collation-server=utf8_unicode_ci
>
>
> I've tested that I could change the http.content.limit property to be a
> nonnegative number (65535) and the fetch job completes, but I want to have
> the non-truncated content available so that I'm crawling all links on the
> page and storing the entire contents of the document so that I can then
> index the entire text in Elasticsearch. Any ideas on how I can fetch and
> store the full content in MySQL? If the answer is - use HBase I'll do that
> I'm just trying to remove another new variable as I learn more about how
> Nutch works. With the content.limit set my crawl completes but I'm missing
> nearly a 3rd of the documents that I would expect because the content is
> being truncated?
>
> Thanks for any advice you can offer.
>
> Thanks,
> Matt
>

Reply via email to