Read UTF-8 encoded string from io.Reader

Issue

I am writing an small communication protocol with TCP sockets.
I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.

The protocol client is written in Java and the server is Go.

As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.

I’d like to know how can I read and write this UTF-8 stream.

Note
I have the byte buffer length on time to read the string.

Solution

Some theory first:

  • A rune in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It’s an alias to uint32.
  • UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.

How this maps on Go data types:

  • Both []byte and string store a series of bytes (a byte in Go is an alias for uint8).

    The chief difference is that strings are immutable, so while you can

      b := make([]byte, 2)
      b[0] = byte('a')
      b[1] = byte('z')
    

    you can’t

      var s string
      s[0] = byte('a')
    

    The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary s := make(string, 10)).

  • While strings in Go contain abstract bytes (you’re free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:

    • A type conversion between string and []rune parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string.
    • A range loop over a string loops through Unicode code points comprising the string, not just bytes.

Go also supplies the type conversions between string and []byte and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like

b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)

always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.

Now back to your task at hand.

If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read() (or whatever) to strings. Be sure to reuse the slice you’re using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you’re gonna to copy the data put to it by the reading code off to a string.

Finally, if you absolutely must to not copy the data (say, you’re dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it’s free to break with any new release of Go, and it’s not even guaranteed to work at all.

Answered By – kostix

Answer Checked By – Candace Johnson (GoLangFix Volunteer)

Leave a Reply

Your email address will not be published.