I am using `content` mediumblob in the SQL table and have not had any
issues so far but that means little as it is probably due to differences
in what we are crawling.  However, I am running into a similar problem
with the outlinks column so I suspect the original webpage table
creation script for MySQL has never been tested with real production
runs.  I am going to play around with that and see if I can get
something more robust this week.

-----Original Message-----
From: Matt MacDonald [mailto:[email protected]] 
Sent: Friday, August 31, 2012 9:38 AM
To: [email protected]
Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column
'content' at row 1

Yeah I had tried using "MEDIUMBLOB" and ran into the same error message.

<field name="content" column="content" jdbc-type="MEDIUMBLOB"/>

It seems likely that we'd want to use HBase in production anyway so I
ended up switching the configuration over to HBase for the crawl and
didn't run into any issues with the content length of the documents. If
anyone knows more about what the mysql issue it might be worth replying
so that others that are encountering the issue don't hit the same dead
end that I did.

Thanks,
Matt

On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema
<[email protected]>wrote:

> Hi,
>
> It might be too obvious but have you already tried to play around with

> the following line in the mapping (seems like a limit to me)?
> <field name="content" column="content" length="65535"/>
>
> It has been a while since I tried to store big content with the 
> SqlStore, so I'm not sure how it works exactly.
>
> Ferdy.
>
> On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <[email protected]>
> wrote:
>
> > Hi,
> >
> > I'm using Nutch 2.0 from the 2.x branch on Github and used
> > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as 
> > the storage backend. I'm seeing the following error show up in my 
> > hadoop.log file while fetching and wonder if others have ideas for 
> > moving past the error without having to set the http.content.limit:
> >
> > java.io.IOException: java.sql.BatchUpdateException: Data truncation:

> > Data too long for column 'content' at row 1
> >         at
org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
> >         at
org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
> >         at
> >
> org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java
> :55)
> >         at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Redu
> ceTask.java:579)
> >         at
> >
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
> >         at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> >         at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
> > 260) Caused by: java.sql.BatchUpdateException: Data truncation: Data

> > too long for column 'content' at row 1
> >         at
> >
> >
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatemen
> t.java:2028)
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1
> 451)
> >         at 
> > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
> >
> >
> > The webpage table in MySQL looks like:
> >
> > mysql> desc webpage;
> > +-------------------+--------------+------+-----+---------+-------+
> > | Field             | Type         | Null | Key | Default | Extra |
> > +-------------------+--------------+------+-----+---------+-------+
> > | id                | varchar(512) | NO   | PRI | NULL    |       |
> > | headers           | blob         | YES  |     | NULL    |       |
> > | text              | mediumtext   | YES  |     | NULL    |       |
> > | status            | int(11)      | YES  |     | NULL    |       |
> > | markers           | blob         | YES  |     | NULL    |       |
> > | parseStatus       | blob         | YES  |     | NULL    |       |
> > | modifiedTime      | bigint(20)   | YES  |     | NULL    |       |
> > | score             | float        | YES  |     | NULL    |       |
> > | typ               | varchar(32)  | YES  |     | NULL    |       |
> > | baseUrl           | varchar(512) | YES  |     | NULL    |       |
> > | content           | mediumblob   | YES  |     | NULL    |       |
> > | title             | varchar(512) | YES  |     | NULL    |       |
> > | reprUrl           | varchar(512) | YES  |     | NULL    |       |
> > | fetchInterval     | int(11)      | YES  |     | NULL    |       |
> > | prevFetchTime     | bigint(20)   | YES  |     | NULL    |       |
> > | inlinks           | blob         | YES  |     | NULL    |       |
> > | prevSignature     | blob         | YES  |     | NULL    |       |
> > | outlinks          | blob         | YES  |     | NULL    |       |
> > | fetchTime         | bigint(20)   | YES  |     | NULL    |       |
> > | retriesSinceFetch | int(11)      | YES  |     | NULL    |       |
> > | protocolStatus    | blob         | YES  |     | NULL    |       |
> > | signature         | blob         | YES  |     | NULL    |       |
> > | metadata          | blob         | YES  |     | NULL    |       |
> > +-------------------+--------------+------+-----+---------+-------+
> > 23 rows in set (0.00 sec)
> >
> >
> >
> > And my gora-sql-mapping.xml file looks like:
> >
> > <class name="org.apache.nutch.storage.WebPage"
> keyClass="java.lang.String"
> > table="webpage">
> >   <primarykey column="id" length="512"/>
> >     <field name="baseUrl" column="baseUrl" length="512"/>
> >     <field name="status" column="status"/>
> >     <field name="prevFetchTime" column="prevFetchTime"/>
> >     <field name="fetchTime" column="fetchTime"/>
> >     <field name="fetchInterval" column="fetchInterval"/>
> >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> >     <field name="reprUrl" column="reprUrl" length="512"/>
> >     <field name="content" column="content" length="65535"/>
> >     <field name="contentType" column="typ" length="32"/>
> >     <field name="protocolStatus" column="protocolStatus"/>
> >     <field name="modifiedTime" column="modifiedTime"/>
> >
> >     <!-- parse fields                                       -->
> >     <field name="title" column="title" length="512"/>
> >     <field name="text" column="text" length="32000"/>
> >     <field name="parseStatus" column="parseStatus"/>
> >     <field name="signature" column="signature"/>
> >     <field name="prevSignature" column="prevSignature"/>
> >
> >     <!-- score fields                                       -->
> >     <field name="score" column="score"/>
> >     <field name="headers" column="headers"/>
> >     <field name="inlinks" column="inlinks"/>
> >     <field name="outlinks" column="outlinks"/>
> >     <field name="metadata" column="metadata"/>
> >     <field name="markers" column="markers"/> </class>
> >
> >
> > The http.content.limit in nutch-default.xml looks like:
> >
> > <property>
> >   <name>http.content.limit</name>
> >   <value>-1</value>
> >   <description>The length limit for downloaded content using the
http
> >   protocol, in bytes. If this value is nonnegative (>=0), content
longer
> >   than it will be truncated; otherwise, no truncation at all. Do not
> >   confuse this setting with the file.content.limit setting.
> >   </description>
> > </property>
> >
> >
> > My my.cnf file looks like:
> >
> > [mysqld]
> > max_allowed_packet             = 200M
> >
> > character-set-server = utf8
> > collation-server=utf8_unicode_ci
> >
> >
> > I've tested that I could change the http.content.limit property to
be a
> > nonnegative number (65535) and the fetch job completes, but I want
to
> have
> > the non-truncated content available so that I'm crawling all links
on the
> > page and storing the entire contents of the document so that I can
then
> > index the entire text in Elasticsearch. Any ideas on how I can fetch
and
> > store the full content in MySQL? If the answer is - use HBase I'll
do
> that
> > I'm just trying to remove another new variable as I learn more about
how
> > Nutch works. With the content.limit set my crawl completes but I'm
> missing
> > nearly a 3rd of the documents that I would expect because the
content is
> > being truncated?
> >
> > Thanks for any advice you can offer.
> >
> > Thanks,
> > Matt
> >
>

Reply via email to