Alright, I need some help with MySQL and its ridiculously over complicated character set configuration. Explain here, give me links, whatever. I've read all of the documentation of it on the mysql website. It is sorely lacking.
From the command line client, my character set settings look like so:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
Everything is happy when using the client like this. I've got some crazy arabic text in the db from some data imported from wikipedia. It appears properly as far as I can tell - I would past here, but it goes all crazy. Unfortunately, when I retrieve the results using java and mysql connector/j or with phpMyAdmin, those same arabic characters that are perfectly fine in the db come out as garbage. Lots of reading and testing leads me to believe this is probably directly related to that character_set_results setting.
If I change character_set_results in mysql client to utf8, then it comes out as garbage, just like I am getting with java and phpMyAdmin. I have tried setting the character_set_results to latin1 when using java, but I still get garbage, which probably means I didn't do it right, but again, the documentation blows so all I can really do is guess and copy what people on forums claim to have done.
So, how do I set the character_set_results with connector/j? I want to set a default for it, not have to issue a specific "SET character_set_results='latin1'" as a query before I do anything.
What the fuck is this character_set_results anyway? It's been made pretty clear that it is the character set that the client is expecting the results back in from mysql. Why does changing it to utf8 cause my shit to be all garbled even when everything on both server and client are set for utf8? I guess mysql is sending the results back encoded in latin1 anyway? Then what is the point of this setting? What other setting do I also need to change? How do I change them for my connector/j settings? php?
Posts
The issue was something related to having utf-8 encoded characters going to a utf-8 column, but several of those settings are set to latin-1 as you can see. I'm guessing character_set_connection was the trouble maker here. It caused mysql to think it was getting latin1 encoded data, when really it was utf8. So then it tried to convert it from latin1 to utf8 when it was inserted. Then when the data was being retrieved, it got converted around all ass backwards again.
So, the lesson here, is that if you're going to insert data that is encoded in utf-8, make sure character_set_connection is utf8, possibly some of the others, too, but I don't know yet.
I'm really surprised there's not more documentation on this. MySQL's website has a quick mention of character_set_results, and a couple pages explaining the heirarchy of the character sets from db to table to column. I've worked with MySQL daily for 1 1/2 years now, plus on and off work with it before that, and have never come across a mention of those variables and their effects depending on how your data is actually encoded until I had to start searching to resolve this issue. Nearly all of my work has been in Perl which appears to be smart enough to just sort out the encodings even with bad settings, so I'd never seen this problem before.