Intro to Golang: Runes, Bytes, and Strings

Golang's a bit out of char-acter in this one

Weekly Spotlight

Introduction

This week's newsletter is dedicated to strings, runes, and bytes in Go. This might seem like a simple topic, but a deeper understanding of these elements will reveal their crucial roles in how text operates within Golang.

If you aren’t already taking weekly programming deep dives with me, subscribe below!

What is a string?

In Go, a string is essentially a read-only slice of bytes. It can hold arbitrary bytes and is not confined to any predefined format, such as Unicode text or UTF-8 text.

Here's a string literal:

const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

Now, if we print this string directly, we get this output:

fmt.Println(sample)
// Output: ��=� ⌘

This output is hardly readable because some of the bytes in our sample string are not valid ASCII or even valid UTF-8. To understand what this string really holds, we need to dissect it and examine the pieces.

for i := 0; i < len(sample); i++ {

    fmt.Printf("%x ", sample[i])

}

// Output: bd b2 3d bc 20 e2 8c 98

In the output, notice how the individual bytes match the hexadecimal escapes that defined the string.

Printing strings

There are a few tricks to print strings in a more presentable way. The %x format verb of fmt.Printf outputs the sequential bytes of the string as hexadecimal digits:

fmt.Printf("%x\n", sample)
// Output: bdb23dbc20e28c98

Adding a space between the % and the x (`% x`) gives us the bytes with spaces between:

fmt.Printf("% x\n", sample)
// Output: bd b2 3d bc 20 e2 8c 98

The %q (quoted) verb escapes any non-printable byte sequences, making the output unambiguous:

fmt.Printf("%q\n", sample)
// Output: "\xbd\xb2=\xbc ⌘"

The “plus” flag with %q (`%+q`) exposes the Unicode values of properly formatted UTF-8 non-ASCII data in the string:

fmt.Printf("%+q\n", sample)
// Output: "\xbd\xb2=\xbc \u2318"

UTF-8 and string literals

When we store a character value in a string, we store its byte-at-a-time representation. For instance, the Unicode character is represented by the bytes e2 8c 98, which are the UTF-8 encoding of the hexadecimal value 2318.

It's important to note that Go source code is defined to be UTF-8 text, which means that when a string literal is written in the source code, the text editor places the UTF-8 encoding of the symbol into the source text. That's why, unless it contains UTF-8-breaking escapes, a regular string literal will also always contain valid UTF-8.

Many people believe Go strings are always UTF-8, but they are not: only string literals are UTF-8. Strings can contain arbitrary bytes, but when constructed from string literals, those bytes are almost always UTF-8.

Code points, characters, and runes

The Unicode standard refers to the item represented by a single value as a “code point”. For example, the Unicode code point U+0061 is the lower-case Latin letter 'a'.

In Go, we use the term "rune" instead of "code point". A rune in Go is an alias for the type int32, allowing programs to clearly indicate when an integer value represents a code point.

Range loops

Go treats UTF-8 specially in only one scenario, and that is when using a for range loop on a string. A for range loop decodes one UTF-8-encoded rune on each iteration, as shown in the example below:

const nihongo = "日本語"
for index, runeValue := range nihongo {

    fmt.Printf("%#U starts at byte position %d\n", runeValue, index)

}

// Output: 

// U+65E5 '日' starts at byte position 0
// U+672C '本' starts at byte position 3
// U+8A9E '語' starts at byte position 6

The output shows how each code point occupies multiple bytes.

[ Zach Coriarty ]

If you aren’t already taking weekly deep dives with me, subscribe below!