Good Day, Thank you for reading,

I'm working with nutch using the org.hsqldb.jdbc.JDBCDriver connector. I'm 
coming across urls with unicode characters, which is causing the jdbc connector 
to throw exceptions when inserting into non-utf formatted columns. With latin1 
encoding the id column can have a length of 767 characters. Switching the 
encoding to utf8mb4 resolves the issue, but at great cost, now the max length 
is 190 characters, or ~767 bytes per primary key/unique key constraints on the 
MySQL database.

That being said, my question is this, what are the repercussions of removing 
the primary key constraint? Does nutch/gora use the constraint to prevent 
duplicates from being inserted? That seems to be the obvious strategy. If that 
is the case, we should redesign using a hash of the url and store the url in a 
larger data type?

please assist / advise,
thank you for your time,

Arni Sumarlidason | Web Developer, Information Technology
MDA | 820 West Diamond Ave | Gaithersburg, MD | USA
[email protected] | http://www.mdaus.com

Reply via email to