I am giving Nutch 2.0 MySQL a go also. I am running into similar issues too.

I changed inlinks on both host and webpage tables to longblob. I also
modified the webpage.title to because 512 was not cutting it. I made couple
of code changes also. I noticed that my updatedb jobs kept on failing
because one of the page had a really long id. It turns out that id
contained body text also. To handle this, I skipped pages with id longer
than 512 in DbUpdateReducer.reduce. Every page skipped had a malformed id.

My current schema is posted below and its running pretty stable for me. I
have only gone through few hundred thousand pages, but I seem to have came
across all the funny ones.

Cheers,
Paul

----- my current nutch 2.0 mysql schema
CREATE TABLE `host` (
  `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '',
  `inlinks` longblob,
  `outlinks` longblob,
  `metadata` longblob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `webpage` (
  `id` varchar(767) CHARACTER SET latin1 NOT NULL DEFAULT '',
  `headers` blob,
  `text` mediumtext,
  `status` int(11) DEFAULT NULL,
  `markers` blob,
  `parseStatus` blob,
  `modifiedTime` bigint(20) DEFAULT NULL,
  `score` float DEFAULT NULL,
  `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
  `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
  `content` mediumblob,
  `title` varchar(2048) DEFAULT NULL,
  `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL,
  `fetchInterval` int(11) DEFAULT NULL,
  `prevFetchTime` bigint(20) DEFAULT NULL,
  `inlinks` longblob,
  `prevSignature` blob,
  `outlinks` blob,
  `fetchTime` bigint(20) DEFAULT NULL,
  `retriesSinceFetch` int(11) DEFAULT NULL,
  `protocolStatus` blob,
  `signature` blob,
  `metadata` blob,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

On Sat, Sep 1, 2012 at 4:25 PM, <[email protected]> wrote:

> I am using `content` mediumblob in the SQL table and have not had any
> issues so far but that means little as it is probably due to differences
> in what we are crawling.  However, I am running into a similar problem
> with the outlinks column so I suspect the original webpage table
> creation script for MySQL has never been tested with real production
> runs.  I am going to play around with that and see if I can get
> something more robust this week.
>
> -----Original Message-----
> From: Matt MacDonald [mailto:[email protected]]
> Sent: Friday, August 31, 2012 9:38 AM
> To: [email protected]
> Subject: Re: Nutch 2.0 MySQL Data truncation: Data too long for column
> 'content' at row 1
>
> Yeah I had tried using "MEDIUMBLOB" and ran into the same error message.
>
> <field name="content" column="content" jdbc-type="MEDIUMBLOB"/>
>
> It seems likely that we'd want to use HBase in production anyway so I
> ended up switching the configuration over to HBase for the crawl and
> didn't run into any issues with the content length of the documents. If
> anyone knows more about what the mysql issue it might be worth replying
> so that others that are encountering the issue don't hit the same dead
> end that I did.
>
> Thanks,
> Matt
>
> On Thu, Aug 30, 2012 at 4:05 PM, Ferdy Galema
> <[email protected]>wrote:
>
> > Hi,
> >
> > It might be too obvious but have you already tried to play around with
>
> > the following line in the mapping (seems like a limit to me)?
> > <field name="content" column="content" length="65535"/>
> >
> > It has been a while since I tried to store big content with the
> > SqlStore, so I'm not sure how it works exactly.
> >
> > Ferdy.
> >
> > On Thu, Aug 30, 2012 at 2:20 PM, Matt MacDonald <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm using Nutch 2.0 from the 2.x branch on Github and used
> > > http://nlp.solutions.asia/?p=180 to configure Nutch to use MySQL as
> > > the storage backend. I'm seeing the following error show up in my
> > > hadoop.log file while fetching and wonder if others have ideas for
> > > moving past the error without having to set the http.content.limit:
> > >
> > > java.io.IOException: java.sql.BatchUpdateException: Data truncation:
>
> > > Data too long for column 'content' at row 1
> > >         at
> org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340)
> > >         at
> org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185)
> > >         at
> > >
> > org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java
> > :55)
> > >         at
> > >
> > >
> > org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(Redu
> > ceTask.java:579)
> > >         at
> > >
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
> > >         at
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
> > >         at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:
> > > 260) Caused by: java.sql.BatchUpdateException: Data truncation: Data
>
> > > too long for column 'content' at row 1
> > >         at
> > >
> > >
> > com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatemen
> > t.java:2028)
> > >         at
> > >
> > com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1
> > 451)
> > >         at
> > > org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328)
> > >
> > >
> > > The webpage table in MySQL looks like:
> > >
> > > mysql> desc webpage;
> > > +-------------------+--------------+------+-----+---------+-------+
> > > | Field             | Type         | Null | Key | Default | Extra |
> > > +-------------------+--------------+------+-----+---------+-------+
> > > | id                | varchar(512) | NO   | PRI | NULL    |       |
> > > | headers           | blob         | YES  |     | NULL    |       |
> > > | text              | mediumtext   | YES  |     | NULL    |       |
> > > | status            | int(11)      | YES  |     | NULL    |       |
> > > | markers           | blob         | YES  |     | NULL    |       |
> > > | parseStatus       | blob         | YES  |     | NULL    |       |
> > > | modifiedTime      | bigint(20)   | YES  |     | NULL    |       |
> > > | score             | float        | YES  |     | NULL    |       |
> > > | typ               | varchar(32)  | YES  |     | NULL    |       |
> > > | baseUrl           | varchar(512) | YES  |     | NULL    |       |
> > > | content           | mediumblob   | YES  |     | NULL    |       |
> > > | title             | varchar(512) | YES  |     | NULL    |       |
> > > | reprUrl           | varchar(512) | YES  |     | NULL    |       |
> > > | fetchInterval     | int(11)      | YES  |     | NULL    |       |
> > > | prevFetchTime     | bigint(20)   | YES  |     | NULL    |       |
> > > | inlinks           | blob         | YES  |     | NULL    |       |
> > > | prevSignature     | blob         | YES  |     | NULL    |       |
> > > | outlinks          | blob         | YES  |     | NULL    |       |
> > > | fetchTime         | bigint(20)   | YES  |     | NULL    |       |
> > > | retriesSinceFetch | int(11)      | YES  |     | NULL    |       |
> > > | protocolStatus    | blob         | YES  |     | NULL    |       |
> > > | signature         | blob         | YES  |     | NULL    |       |
> > > | metadata          | blob         | YES  |     | NULL    |       |
> > > +-------------------+--------------+------+-----+---------+-------+
> > > 23 rows in set (0.00 sec)
> > >
> > >
> > >
> > > And my gora-sql-mapping.xml file looks like:
> > >
> > > <class name="org.apache.nutch.storage.WebPage"
> > keyClass="java.lang.String"
> > > table="webpage">
> > >   <primarykey column="id" length="512"/>
> > >     <field name="baseUrl" column="baseUrl" length="512"/>
> > >     <field name="status" column="status"/>
> > >     <field name="prevFetchTime" column="prevFetchTime"/>
> > >     <field name="fetchTime" column="fetchTime"/>
> > >     <field name="fetchInterval" column="fetchInterval"/>
> > >     <field name="retriesSinceFetch" column="retriesSinceFetch"/>
> > >     <field name="reprUrl" column="reprUrl" length="512"/>
> > >     <field name="content" column="content" length="65535"/>
> > >     <field name="contentType" column="typ" length="32"/>
> > >     <field name="protocolStatus" column="protocolStatus"/>
> > >     <field name="modifiedTime" column="modifiedTime"/>
> > >
> > >     <!-- parse fields                                       -->
> > >     <field name="title" column="title" length="512"/>
> > >     <field name="text" column="text" length="32000"/>
> > >     <field name="parseStatus" column="parseStatus"/>
> > >     <field name="signature" column="signature"/>
> > >     <field name="prevSignature" column="prevSignature"/>
> > >
> > >     <!-- score fields                                       -->
> > >     <field name="score" column="score"/>
> > >     <field name="headers" column="headers"/>
> > >     <field name="inlinks" column="inlinks"/>
> > >     <field name="outlinks" column="outlinks"/>
> > >     <field name="metadata" column="metadata"/>
> > >     <field name="markers" column="markers"/> </class>
> > >
> > >
> > > The http.content.limit in nutch-default.xml looks like:
> > >
> > > <property>
> > >   <name>http.content.limit</name>
> > >   <value>-1</value>
> > >   <description>The length limit for downloaded content using the
> http
> > >   protocol, in bytes. If this value is nonnegative (>=0), content
> longer
> > >   than it will be truncated; otherwise, no truncation at all. Do not
> > >   confuse this setting with the file.content.limit setting.
> > >   </description>
> > > </property>
> > >
> > >
> > > My my.cnf file looks like:
> > >
> > > [mysqld]
> > > max_allowed_packet             = 200M
> > >
> > > character-set-server = utf8
> > > collation-server=utf8_unicode_ci
> > >
> > >
> > > I've tested that I could change the http.content.limit property to
> be a
> > > nonnegative number (65535) and the fetch job completes, but I want
> to
> > have
> > > the non-truncated content available so that I'm crawling all links
> on the
> > > page and storing the entire contents of the document so that I can
> then
> > > index the entire text in Elasticsearch. Any ideas on how I can fetch
> and
> > > store the full content in MySQL? If the answer is - use HBase I'll
> do
> > that
> > > I'm just trying to remove another new variable as I learn more about
> how
> > > Nutch works. With the content.limit set my crawl completes but I'm
> > missing
> > > nearly a 3rd of the documents that I would expect because the
> content is
> > > being truncated?
> > >
> > > Thanks for any advice you can offer.
> > >
> > > Thanks,
> > > Matt
> > >
> >
>

Reply via email to