Encoding UTF8 string to latin1/iso-8859-1 with Go and MySQL

Issue

I have a MySQL database with German "Umlaute öäü" in a table and I need to write a Go app that reads the table, encode it to ISO-8859-1 and write it to a text file.

So far so good, but the encoding to iso-8859-1 is not working. I have tried to debug this.

Here some details and information:

MySQL
The MySQL database is UTF8, also the table itself. Also other character sets should be fine, except the character_set_server, but I think this is not relevant here, it should be just a default for new databases as far as I know.

When I query the database with the following SQL, I get the correct UTF8 encoded text:

select street, hex(street) from test_table where id = '36'

Result: (in real it is called Fröbelstraße)
Fröbelstraße, 4672C3B662656C73747261C39F65

So from the hex string it is basically exact what I have expected. OK.

Go App
Just the relevant parts….

db, err := sql.Open("mysql", "...<connection string>...")
res, err := db.Query("select street from from test_table where id = '36'")

for res.Next() {
var pb Phonebook
        err := res.Scan(&pb.Street)
        fmt.Println(hex.EncodeToString([]byte(pb.Street)))
}

The output is
4672c383c2b662656c73747261c383c5b865

And that’s the problem why my encoding to ISO-8859-1 is not working because the string from the database is not correct. The hex from the db direct query is correct and also working the the encoding.

But I don’t understand why I get a different string from the go client.

In the original string "Fröbelstraße" are 2 characters "ö" which is C3B6 and "ß" which is C39F. The hex from the query with a db client is ok, but the one from the go app is too long, because I get a 2 bytes more per character.

When I feed my latin1 converter, with the correct hex string, it is working fine, I get an iso-8859-1 string. But not from the other one I query directly from Go.

I do this with

    d := charmap.ISO8859_1.NewEncoder()
    out, err := d.String(inp)

Also just a snippet, I actually call a function with a string, but I never got the correct iso8859-1 result. So I tried it with the hex code from the MySQL client query, converted back to a string, and bingo the iso8859-1 result is correct.

I also tried to query from python and get there the same strange hex from the queried string, so I am completely lost of what is wrong here. Cannot be go, because it is the same in python. But in the db it is stored correctly in my opinion and the MySQL character sets are all set the utf8mb4 or utf8, except the one I mentioned in above.

Solution

4672c383c2b662656c73747261c383c5b865 appears to be "double-encoded". Look for that in Trouble with UTF-8 characters; what I see is not what I stored

However, since you were getting the hex from the app, not from the table, it is inconclusive. Please do SELECT HEX(col) FROM ... to see if you get exactly that string. Often, apps, especially browsers, try to "fix" the problem, thereby making it harder do diagnose correctly.

If you have "CHARACTER SET utf8mb4 with double-encoding", then this may cure the data:

UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8mb4);

However, you need to fix the source of the data, too.

Answered By – Rick James

Answer Checked By – David Marino (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.