Hi,

I'm using Nutch 2.0 from the 2.x branch on Github and used
http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as the
storage backend. I'm seeing the following error show up in my hadoop.log
file while fetching and wonder if others have ideas for moving past the
error without having to set the http.content.limit:

java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data
too long for column 'content' at row 1
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
        at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
        at
org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55)
        at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:579)
        at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: java.sql.BatchUpdateException: Data truncation: Data too long
for column 'content' at row 1
        at
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2028)
        at
com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1451)
        at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)


The webpage table in MySQL looks like:

mysql> desc webpage;
+-------------------+--------------+------+-----+---------+-------+
| Field             | Type         | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| id                | varchar(512) | NO   | PRI | NULL    |       |
| headers           | blob         | YES  |     | NULL    |       |
| text              | mediumtext   | YES  |     | NULL    |       |
| status            | int(11)      | YES  |     | NULL    |       |
| markers           | blob         | YES  |     | NULL    |       |
| parseStatus       | blob         | YES  |     | NULL    |       |
| modifiedTime      | bigint(20)   | YES  |     | NULL    |       |
| score             | float        | YES  |     | NULL    |       |
| typ               | varchar(32)  | YES  |     | NULL    |       |
| baseUrl           | varchar(512) | YES  |     | NULL    |       |
| content           | mediumblob   | YES  |     | NULL    |       |
| title             | varchar(512) | YES  |     | NULL    |       |
| reprUrl           | varchar(512) | YES  |     | NULL    |       |
| fetchInterval     | int(11)      | YES  |     | NULL    |       |
| prevFetchTime     | bigint(20)   | YES  |     | NULL    |       |
| inlinks           | blob         | YES  |     | NULL    |       |
| prevSignature     | blob         | YES  |     | NULL    |       |
| outlinks          | blob         | YES  |     | NULL    |       |
| fetchTime         | bigint(20)   | YES  |     | NULL    |       |
| retriesSinceFetch | int(11)      | YES  |     | NULL    |       |
| protocolStatus    | blob         | YES  |     | NULL    |       |
| signature         | blob         | YES  |     | NULL    |       |
| metadata          | blob         | YES  |     | NULL    |       |
+-------------------+--------------+------+-----+---------+-------+
23 rows in set (0.00 sec)



And my gora-sql-mapping.xml file looks like:

<class name="org.apache.nutch.storage.WebPage" keyClass="java.lang.String"
table="webpage">
  <primarykey column="id" length="512"/>
    <field name="baseUrl" column="baseUrl" length="512"/>
    <field name="status" column="status"/>
    <field name="prevFetchTime" column="prevFetchTime"/>
    <field name="fetchTime" column="fetchTime"/>
    <field name="fetchInterval" column="fetchInterval"/>
    <field name="retriesSinceFetch" column="retriesSinceFetch"/>
    <field name="reprUrl" column="reprUrl" length="512"/>
    <field name="content" column="content" length="65535"/>
    <field name="contentType" column="typ" length="32"/>
    <field name="protocolStatus" column="protocolStatus"/>
    <field name="modifiedTime" column="modifiedTime"/>

    <!-- parse fields                                       -->
    <field name="title" column="title" length="512"/>
    <field name="text" column="text" length="32000"/>
    <field name="parseStatus" column="parseStatus"/>
    <field name="signature" column="signature"/>
    <field name="prevSignature" column="prevSignature"/>

    <!-- score fields                                       -->
    <field name="score" column="score"/>
    <field name="headers" column="headers"/>
    <field name="inlinks" column="inlinks"/>
    <field name="outlinks" column="outlinks"/>
    <field name="metadata" column="metadata"/>
    <field name="markers" column="markers"/>
</class>


The http.content.limit in nutch-default.xml looks like:

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>


My my.cnf file looks like:

[mysqld]
max_allowed_packet             = 200M

character-set-server = utf8
collation-server=utf8_unicode_ci


I've tested that I could change the http.content.limit property to be a
nonnegative number (65535) and the fetch job completes, but I want to have
the non-truncated content available so that I'm crawling all links on the
page and storing the entire contents of the document so that I can then
index the entire text in Elasticsearch. Any ideas on how I can fetch and
store the full content in MySQL? If the answer is - use HBase I'll do that
I'm just trying to remove another new variable as I learn more about how
Nutch works. With the content.limit set my crawl completes but I'm missing
nearly a 3rd of the documents that I would expect because the content is
being truncated?

Thanks for any advice you can offer.

Thanks,
Matt

Reply via email to