If it is the ID field on MySQL let me share my experience.

I did not change that for a reason. The ID field is a primary key and has a 
rather small size limit due to MySQL limitations (I think it is 1000 bytes). 
Even for Japanese language sites where I do most of my crawling all of the URLs 
I am interested in are Ascii so in my case there wasn't the need.

If you do need to change it (for non-English URLs) use varchar not char because 
if you use char with UTF8 or utf8mb4 you will allocate 3 or 4 bytes 
respectively per character regardless of what the character actually needs and 
you will find long urls causing errors as they exceed the limit you set or the 
1000 byte absolute limit. If you use varchar it will allocate the needed amount 
of bytes (1 per English character more for characters in other languages).  
That doesn't completely rule out running into a URL that takes more than 1000 
bytes but makes it whole lot less likely as the majority of urls use English 
characters.

I will look at changing it in my example but it may be a bit later as I think 
it needs some testing to make sure I don't cause problems. 

If you are crawling websites with UR
-----Original Message-----
From: sumarlidason [mailto:[email protected]] 
Sent: Thursday, October 25, 2012 12:48 AM
To: [email protected]
Subject: RE: nutch/hadoop/solr

err, manager pointed out that its the ID field complaining now.. so attempting 
to change the collation there as well.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/nutch-hadoop-solr-tp4014761p4015626.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to