wc clone

wc (word count) is a simple Unix utility to get line, character, word, or byte counts from a file.

Terminal window
# Outputs the number of lines in filename.txt
wc -l filename.txt
# Outputs the number of words in filename.txt
wc -w filename.txt
# Outputs the number of bytes in filename.txt
wc -c filename.txt
# Outputs the number of characters in filename.txt
wc -m filename.txt
# Outputs the number of lines, words, and characters in filename.txt
wc -lwm filename.txt
# Outputs the number of lines by piping the contents of filename.txt to wc
cat filename.txt | wc -l
# Outputs line, word, and byte counts for multiple files and a total line at the end
wc filename.txt anotherfile.txt
# Outputs the number of lines for all text files in the current directory, summarized
wc -l *.txt

I’m going to use Go and Cobra for this project. (You may need to install the cobra-cli to run this command.)

Terminal window
mkdir wc && \
cd wc && \
go mod init wc && \
cobra-cli init

This gives us a great starting point.

tree .
.
β”œβ”€β”€ LICENSE
β”œβ”€β”€ cmd
β”‚Β Β  └── root.go
β”œβ”€β”€ go.mod
β”œβ”€β”€ go.sum
└── main.go
2 directories, 5 files

We will be working mostly in cmd/root.go, as wc only has 1 root command. It doesn’t have subcommands like git commit.

Flags

We’ll use the init function in cmd/root.go to specify our flags. We’re trying to mimic man wc.

func init() {
rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output. This will cancel out any prior usage of the -m option.")
rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")
rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")
rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output. If the current locale does not support multibyte characters, this is equivalent to the -c option. This will cancel out any prior usage of the -c option.")
}

Now we can run it with go run main.go -l, etc.

Tests

Before we go any further, I’d like to set up some tests. They’ll be failing of course.

test.sh
#!/bin/bash
# Build the ccwc binary
go build -o ccwc
# Compare counts and exit with code 1 if any difference is found
echo "Running: diff <(wc test.txt) <(./ccwc test.txt)" && diff <(wc test.txt) <(./ccwc test.txt) && \
echo "Running: diff <(wc -l test.txt) <(./ccwc -l test.txt)" && diff <(wc -l test.txt) <(./ccwc -l test.txt) && \
echo "Running: diff <(wc -w test.txt) <(./ccwc -w test.txt)" && diff <(wc -w test.txt) <(./ccwc -w test.txt) && \
echo "Running: diff <(wc -c test.txt) <(./ccwc -c test.txt)" && diff <(wc -c test.txt) <(./ccwc -c test.txt) && \
echo "Running: diff <(wc -m test.txt) <(./ccwc -m test.txt)" && diff <(wc -m test.txt) <(./ccwc -m test.txt) && \
echo "Running: diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt)" && diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt) && \
echo "Running: diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt)" && diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt) && \
echo "Running: diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt)" && diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt) && \
# This is the case I'm choosing to differ from wc on.
# echo "Running: diff <(wc -cm test.txt) <(./ccwc -cm test.txt)" && diff <(wc -cm test.txt) <(./ccwc -cm test.txt) && \
echo "Running: diff <(wc -mc test.txt) <(./ccwc -mc test.txt)" && diff <(wc -mc test.txt) <(./ccwc -mc test.txt) && \
echo "Running: diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt)" && diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt) || \
exit 1
echo "All tests passed!"
# Clean up the ccwc binary
rm ccwc

You can run these with

Terminal window
chmod +x ./test.sh
./test.sh

Parsing The File

Let’s write a function to parse the file and calculate all of the necessary values. I’m going to specify a struct as a return type for this function, so we have a clean interface to work with internally.

type FileParseResult struct {
filename string
lines int
words int
chars int
bytes int
}

Now, let’s write our function to parse the file and calculate each of these values.

A few notes:

  • We only want to parse the file once.
  • We can’t use scanner.Scan(), because this will give an incorrect char count, by removing carriage returns when we have Windows newlines, e.g. \r\n instead of \n. Personally this doesn’t matter much to me, but I want our implementation to be consistent with wc as much as possible.
  • We’re going to take in a io.Reader instead of a filename, so we can use this function as well with standard input.
func getCounts(rd io.Reader, name string) (FileParseResult, error) {
// @note: cannot use scanner because new line characters
// are stripped, and \n vs. \n\r affects the char count
reader := bufio.NewReader(rd)
lines := 0
words := 0
chars := 0
bytes := 0
for {
line, err := reader.ReadString('\n')
if err != nil && err != io.EOF {
return FileParseResult{}, err
}
// @note: will count an extra line if the file ends with a newline
if err == io.EOF && len(line) == 0 {
break
}
lines++
words += len(strings.Fields(line))
chars += utf8.RuneCountInString(line)
bytes += len(line)
if err == io.EOF {
break
}
}
return FileParseResult{
lines: lines,
words: words,
chars: chars,
bytes: bytes,
filename: name,
}, nil
}

This can work with standard input like this:

reader := bufio.NewReader(os.Stdin)
fileParseResult, err := getCounts(reader, "")

Note that we’re also going to need a way to match wc’s output format. This took some playing around with, but here is what I came up with:

func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {
s := ""
if linesFlag || allFlagsDisabled {
s += fmt.Sprintf("%8d", f.lines)
}
if wordsFlag || allFlagsDisabled {
s += fmt.Sprintf("%8d", f.words)
}
if charsEnabled {
s += fmt.Sprintf("%8d", f.chars)
}
if bytesFlag || allFlagsDisabled {
s += fmt.Sprintf("%8d", f.bytes)
}
fmt.Printf(s + " " + f.filename + "\n")
}

Putting It All Together

root.go
1
package cmd
2
3
import (
4
"bufio"
5
"fmt"
6
"io"
7
"os"
8
"strings"
9
"unicode/utf8"
10
11
"github.com/spf13/cobra"
12
)
13
14
func check(e error) {
15
if e != nil {
16
panic(e)
17
}
18
}
19
20
func getCounts(rd io.Reader, name string) (FileParseResult, error) {
21
// @note: cannot use scanner because new line characters
22
// are stripped, and \n vs. \n\r affects the char count
23
reader := bufio.NewReader(rd)
24
lines := 0
25
words := 0
26
chars := 0
27
bytes := 0
28
29
for {
30
line, err := reader.ReadString('\n')
31
if err != nil && err != io.EOF {
32
return FileParseResult{}, err
33
}
34
35
// @note: will count an extra line if the file ends with a newline
36
if err == io.EOF && len(line) == 0 {
37
break
38
}
39
40
lines++
41
words += len(strings.Fields(line))
42
chars += utf8.RuneCountInString(line)
43
bytes += len(line)
44
45
if err == io.EOF {
46
break
47
}
48
}
49
50
return FileParseResult{
51
lines: lines,
52
words: words,
53
chars: chars,
54
bytes: bytes,
55
filename: name,
56
}, nil
57
}
58
59
type FileParseResult struct {
60
filename string
61
lines int
62
words int
63
chars int
64
bytes int
65
}
66
67
func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {
68
s := ""
69
70
if linesFlag || allFlagsDisabled {
71
s += fmt.Sprintf("%8d", f.lines)
72
}
73
if wordsFlag || allFlagsDisabled {
74
s += fmt.Sprintf("%8d", f.words)
75
}
76
if charsEnabled {
77
s += fmt.Sprintf("%8d", f.chars)
78
}
79
if bytesFlag || allFlagsDisabled {
80
s += fmt.Sprintf("%8d", f.bytes)
81
}
82
83
fmt.Printf(s + " " + f.filename + "\n")
84
}
85
86
var rootCmd = &cobra.Command{
87
Use: "wc",
88
Short: "word, line, character, and byte count",
89
Long: `A clone of the wc command in Unix. Do "man wc" for more information.`,
90
RunE: func(cmd *cobra.Command, files []string) error {
91
bytesFlag, _ := cmd.Flags().GetBool("bytes")
92
linesFlag, _ := cmd.Flags().GetBool("lines")
93
wordsFlag, _ := cmd.Flags().GetBool("words")
94
charsFlag, _ := cmd.Flags().GetBool("chars")
95
96
// @note: I'm varying from official wc behavior here.
97
// They will take the last of -c and -m if both are used.
98
// I'm simply going to use -m if both are used.
99
// Cobra does not have a simple way of getting the order.
100
// Also, I really dislike -cm and -mc giving different behavior.
101
charsEnabled := charsFlag && !bytesFlag
102
allFlagsDisabled := !bytesFlag && !linesFlag && !wordsFlag && !charsFlag
103
104
totalLines := 0
105
totalWords := 0
106
totalChars := 0
107
totalBytes := 0
108
109
if len(files) == 0 {
110
reader := bufio.NewReader(os.Stdin)
111
fileParseResult, err := getCounts(reader, "")
112
check(err)
113
114
totalLines += fileParseResult.lines
115
totalWords += fileParseResult.words
116
totalChars += fileParseResult.chars
117
totalBytes += fileParseResult.bytes
118
119
}
120
121
for _, file := range files {
122
fileReader, err := os.Open(file)
123
check(err)
124
125
fileParseResult, err := getCounts(fileReader, file)
126
check(err)
127
128
fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
129
130
totalLines += fileParseResult.lines
131
totalWords += fileParseResult.words
132
totalChars += fileParseResult.chars
133
totalBytes += fileParseResult.bytes
134
}
135
136
if len(files) > 1 {
137
totalResult := FileParseResult{
138
lines: totalLines,
139
words: totalWords,
140
chars: totalChars,
141
bytes: totalBytes,
142
filename: "total",
143
}
144
145
totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
146
}
147
148
return nil
149
},
150
}
151
152
func Execute() {
153
err := rootCmd.Execute()
154
if err != nil {
155
os.Exit(1)
156
}
157
}
158
159
func init() {
160
rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output. This will cancel out any prior usage of the -m option.")
161
162
rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")
163
164
rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")
165
166
rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output. If the current locale does not support multibyte characters, this is equivalent to the -c option. This will cancel out any prior usage of the -c option.")
167
}

There is, however, a problem. What if I do this?

Terminal window
# I have a large file, test.txt
mkdir temp && \
cd temp && \
for i in {1..100000}; do cp ../test.txt test$i.txt; done
# then...
./ccwc temp/*.txt

Oh no!

Note - you may be surprised our program supports wildcards out of the box. Actually, shells expand that for us and pass in the array of arguments to our program.

Handling Many Many Files

Now we need to add concurrency. We could make every call to getCounts in its own goroutine, and run them all immediately, but then we’d have n goroutines & open file handles at once (where n is the number of files).

I’m going to add a semaphore, just like I show here.

root.go
1
package cmd
2
3
import (
4
"bufio"
5
"fmt"
6
"io"
7
"os"
8
"strings"
9
"sync"
10
"unicode/utf8"
11
12
"wc/utils"
13
14
"github.com/spf13/cobra"
15
)
16
17
// ...
18
19
var rootCmd = &cobra.Command{
20
Use: "wc",
21
Short: "word, line, character, and byte count",
22
Long: `A clone of the wc command in Unix. Do "man wc" for more information.`,
23
RunE: func(cmd *cobra.Command, files []string) error {
24
// ...
25
26
semaphore := utils.NewSemaphore(50)
27
wg := sync.WaitGroup{}
28
totals := make(chan FileParseResult)
29
30
totalLines := 0
31
totalWords := 0
32
totalChars := 0
33
totalBytes := 0
34
35
go func() {
36
for fileParseResult := range totals {
37
totalLines += fileParseResult.lines
38
totalWords += fileParseResult.words
39
totalChars += fileParseResult.chars
40
totalBytes += fileParseResult.bytes
41
}
42
}()
43
44
if len(files) == 0 {
45
semaphore.Acquire()
46
wg.Add(1)
47
go func() {
48
defer semaphore.Release()
49
defer wg.Done()
50
51
reader := bufio.NewReader(os.Stdin)
52
fileParseResult, err := getCounts(reader, "")
53
check(err)
54
55
totals <- fileParseResult
56
}()
57
}
58
59
for _, file := range files {
60
semaphore.Acquire()
61
wg.Add(1)
62
go func(file string) {
63
defer semaphore.Release()
64
defer wg.Done()
65
66
fileReader, err := os.Open(file)
67
check(err)
68
69
fileParseResult, err := getCounts(fileReader, file)
70
check(err)
71
72
fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
73
74
totals <- fileParseResult
75
}(file)
76
}
77
78
wg.Wait()
79
close(totals)
80
81
if len(files) > 1 {
82
totalResult := FileParseResult{
83
lines: totalLines,
84
words: totalWords,
85
chars: totalChars,
86
bytes: totalBytes,
87
filename: "total",
88
}
89
90
totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
91
}
92
93
return nil
94
},
95
}

Now we can run this thing on as many files as we want, and it will be fast, but also keep a reasonable memory profile.

Here is the Github repo with all of the code.

Wow! You read the whole thing. People who make it this far sometimes want to receive emails when I post something new.

I also have an RSS feed.