- [ between the brackets ]
- Posts
- Intro to Golang: Runes, Bytes, and Strings
Intro to Golang: Runes, Bytes, and Strings
Golang's a bit out of char-acter in this one
Weekly Spotlight
Introduction
This week's newsletter is dedicated to strings, runes, and bytes in Go. This might seem like a simple topic, but a deeper understanding of these elements will reveal their crucial roles in how text operates within Golang.
What is a string?
In Go, a string is essentially a read-only slice of bytes. It can hold arbitrary bytes and is not confined to any predefined format, such as Unicode text or UTF-8 text.
Here's a string literal:
const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"
Now, if we print this string directly, we get this output:
fmt.Println(sample)
// Output: ��=� ⌘
This output is hardly readable because some of the bytes in our sample string are not valid ASCII or even valid UTF-8. To understand what this string really holds, we need to dissect it and examine the pieces.
for i := 0; i < len(sample); i++ {
fmt.Printf("%x ", sample[i])
}
// Output: bd b2 3d bc 20 e2 8c 98
In the output, notice how the individual bytes match the hexadecimal escapes that defined the string.
Printing strings
There are a few tricks to print strings in a more presentable way. The %x
format verb of fmt.Printf
outputs the sequential bytes of the string as hexadecimal digits:
fmt.Printf("%x\n", sample)
// Output: bdb23dbc20e28c98
Adding a space between the %
and the x
(`% x`) gives us the bytes with spaces between:
fmt.Printf("% x\n", sample)
// Output: bd b2 3d bc 20 e2 8c 98
The %q
(quoted) verb escapes any non-printable byte sequences, making the output unambiguous:
fmt.Printf("%q\n", sample)
// Output: "\xbd\xb2=\xbc ⌘"
The “plus” flag with %q
(`%+q`) exposes the Unicode values of properly formatted UTF-8 non-ASCII data in the string:
fmt.Printf("%+q\n", sample)
// Output: "\xbd\xb2=\xbc \u2318"
UTF-8 and string literals
When we store a character value in a string, we store its byte-at-a-time representation. For instance, the Unicode character ⌘
is represented by the bytes e2 8c 98
, which are the UTF-8 encoding of the hexadecimal value 2318
.
It's important to note that Go source code is defined to be UTF-8 text, which means that when a string literal is written in the source code, the text editor places the UTF-8 encoding of the symbol into the source text. That's why, unless it contains UTF-8-breaking escapes, a regular string literal will also always contain valid UTF-8.
Many people believe Go strings are always UTF-8, but they are not: only string literals are UTF-8. Strings can contain arbitrary bytes, but when constructed from string literals, those bytes are almost always UTF-8.
Code points, characters, and runes
The Unicode standard refers to the item represented by a single value as a “code point”. For example, the Unicode code point U+0061
is the lower-case Latin letter 'a'.
In Go, we use the term "rune" instead of "code point". A rune in Go is an alias for the type int32
, allowing programs to clearly indicate when an integer value represents a code point.
Range loops
Go treats UTF-8 specially in only one scenario, and that is when using a for range
loop on a string. A for range
loop decodes one UTF-8-encoded rune on each iteration, as shown in the example below:
const nihongo = "日本語"
for index, runeValue := range nihongo {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
// Output:
// U+65E5 '日' starts at byte position 0
// U+672C '本' starts at byte position 3
// U+8A9E '語' starts at byte position 6
The output shows how each code point occupies multiple bytes.
[ Zach Coriarty ]