wc clone

May 2, 2024

wc (word count) is a simple Unix utility to get line, character, word, or byte counts from a file.

# Outputs the number of lines in filename.txt
wc -l filename.txt

# Outputs the number of words in filename.txt
wc -w filename.txt

# Outputs the number of bytes in filename.txt
wc -c filename.txt

# Outputs the number of characters in filename.txt
wc -m filename.txt

# Outputs the number of lines, words, and characters in filename.txt
wc -lwm filename.txt

# Outputs the number of lines by piping the contents of filename.txt to wc
cat filename.txt | wc -l

# Outputs line, word, and byte counts for multiple files and a total line at the end
wc filename.txt anotherfile.txt

# Outputs the number of lines for all text files in the current directory, summarized
wc -l *.txt

I’m going to use Go and Cobra for this project. (You may need to install the cobra-cli to run this command.)

mkdir wc && \
  cd wc && \
  go mod init wc && \
  cobra-cli init

This gives us a great starting point.

tree .
.
├── LICENSE
├── cmd
│   └── root.go
├── go.mod
├── go.sum
└── main.go

2 directories, 5 files

We will be working mostly in cmd/root.go, as wc only has 1 root command. It doesn’t have subcommands like git commit.

Flags

We’ll use the init function in cmd/root.go to specify our flags. We’re trying to mimic man wc.

func init() {
  rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output.  This will cancel out any prior usage of the -m option.")

  rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")

  rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")

  rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output.  If the current locale does not support multibyte characters, this is equivalent to the -c option.  This will cancel out any prior usage of the -c option.")
}

Now we can run it with go run main.go -l, etc.

Tests

Before we go any further, I’d like to set up some tests. They’ll be failing of course.

#!/bin/bash

# Build the ccwc binary
go build -o ccwc

# Compare counts and exit with code 1 if any difference is found
echo "Running: diff <(wc test.txt) <(./ccwc test.txt)" && diff <(wc test.txt) <(./ccwc test.txt) && \
echo "Running: diff <(wc -l test.txt) <(./ccwc -l test.txt)" && diff <(wc -l test.txt) <(./ccwc -l test.txt) && \
echo "Running: diff <(wc -w test.txt) <(./ccwc -w test.txt)" && diff <(wc -w test.txt) <(./ccwc -w test.txt) && \
echo "Running: diff <(wc -c test.txt) <(./ccwc -c test.txt)" && diff <(wc -c test.txt) <(./ccwc -c test.txt) && \
echo "Running: diff <(wc -m test.txt) <(./ccwc -m test.txt)" && diff <(wc -m test.txt) <(./ccwc -m test.txt) && \
echo "Running: diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt)" && diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt) && \
echo "Running: diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt)" && diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt) && \
echo "Running: diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt)" && diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt) && \
# This is the case I'm choosing to differ from wc on.
# echo "Running: diff <(wc -cm test.txt) <(./ccwc -cm test.txt)" && diff <(wc -cm test.txt) <(./ccwc -cm test.txt) && \
echo "Running: diff <(wc -mc test.txt) <(./ccwc -mc test.txt)" && diff <(wc -mc test.txt) <(./ccwc -mc test.txt) && \
echo "Running: diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt)" && diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt) || \
exit 1

echo "All tests passed!"

# Clean up the ccwc binary
rm ccwc

You can run these with

chmod +x ./test.sh
./test.sh

Parsing The File

Let’s write a function to parse the file and calculate all of the necessary values. I’m going to specify a struct as a return type for this function, so we have a clean interface to work with internally.

type FileParseResult struct {
  filename string
  lines    int
  words    int
  chars    int
  bytes    int
}

Now, let’s write our function to parse the file and calculate each of these values.

A few notes:

We only want to parse the file once.
We can’t use scanner.Scan(), because this will give an incorrect char count, by removing carriage returns when we have Windows newlines, e.g. \r\n instead of \n. Personally this doesn’t matter much to me, but I want our implementation to be consistent with wc as much as possible.
We’re going to take in a io.Reader instead of a filename, so we can use this function as well with standard input.

func getCounts(rd io.Reader, name string) (FileParseResult, error) {
  // @note: cannot use scanner because new line characters
  //        are stripped, and \n vs. \n\r affects the char count
  reader := bufio.NewReader(rd)
  lines := 0
  words := 0
  chars := 0
  bytes := 0

  for {
    line, err := reader.ReadString('\n')
    if err != nil && err != io.EOF {
      return FileParseResult{}, err
    }

    // @note: will count an extra line if the file ends with a newline
    if err == io.EOF && len(line) == 0 {
      break
    }

    lines++
    words += len(strings.Fields(line))
    chars += utf8.RuneCountInString(line)
    bytes += len(line)

    if err == io.EOF {
      break
    }
  }

  return FileParseResult{
    lines:    lines,
    words:    words,
    chars:    chars,
    bytes:    bytes,
    filename: name,
  }, nil
}

This can work with standard input like this:

reader := bufio.NewReader(os.Stdin)
fileParseResult, err := getCounts(reader, "")

Note that we’re also going to need a way to match wc’s output format. This took some playing around with, but here is what I came up with:

func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {
  s := ""

  if linesFlag || allFlagsDisabled {
    s += fmt.Sprintf("%8d", f.lines)
  }
  if wordsFlag || allFlagsDisabled {
    s += fmt.Sprintf("%8d", f.words)
  }
  if charsEnabled {
    s += fmt.Sprintf("%8d", f.chars)
  }
  if bytesFlag || allFlagsDisabled {
    s += fmt.Sprintf("%8d", f.bytes)
  }

  fmt.Printf(s + " " + f.filename + "\n")
}

Putting It All Together

1
package cmd
2

3
import (
4
  "bufio"
5
  "fmt"
6
  "io"
7
  "os"
8
  "strings"
9
  "unicode/utf8"
10

11
  "github.com/spf13/cobra"
12
)
13

14
func check(e error) {
15
  if e != nil {
16
    panic(e)
17
  }
18
}
19

20
func getCounts(rd io.Reader, name string) (FileParseResult, error) {
21
  // @note: cannot use scanner because new line characters
22
  //        are stripped, and \n vs. \n\r affects the char count
23
  reader := bufio.NewReader(rd)
24
  lines := 0
25
  words := 0
26
  chars := 0
27
  bytes := 0
28

29
  for {
30
    line, err := reader.ReadString('\n')
31
    if err != nil && err != io.EOF {
32
      return FileParseResult{}, err
33
    }
34

35
    // @note: will count an extra line if the file ends with a newline
36
    if err == io.EOF && len(line) == 0 {
37
      break
38
    }
39

40
    lines++
41
    words += len(strings.Fields(line))
42
    chars += utf8.RuneCountInString(line)
43
    bytes += len(line)
44

45
    if err == io.EOF {
46
      break
47
    }
48
  }
49

50
  return FileParseResult{
51
    lines:    lines,
52
    words:    words,
53
    chars:    chars,
54
    bytes:    bytes,
55
    filename: name,
56
  }, nil
57
}
58

59
type FileParseResult struct {
60
  filename string
61
  lines    int
62
  words    int
63
  chars    int
64
  bytes    int
65
}
66

67
func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {
68
  s := ""
69

70
  if linesFlag || allFlagsDisabled {
71
    s += fmt.Sprintf("%8d", f.lines)
72
  }
73
  if wordsFlag || allFlagsDisabled {
74
    s += fmt.Sprintf("%8d", f.words)
75
  }
76
  if charsEnabled {
77
    s += fmt.Sprintf("%8d", f.chars)
78
  }
79
  if bytesFlag || allFlagsDisabled {
80
    s += fmt.Sprintf("%8d", f.bytes)
81
  }
82

83
  fmt.Printf(s + " " + f.filename + "\n")
84
}
85

86
var rootCmd = &cobra.Command{
87
  Use:   "wc",
88
  Short: "word, line, character, and byte count",
89
  Long:  `A clone of the wc command in Unix. Do "man wc" for more information.`,
90
  RunE: func(cmd *cobra.Command, files []string) error {
91
    bytesFlag, _ := cmd.Flags().GetBool("bytes")
92
    linesFlag, _ := cmd.Flags().GetBool("lines")
93
    wordsFlag, _ := cmd.Flags().GetBool("words")
94
    charsFlag, _ := cmd.Flags().GetBool("chars")
95

96
    // @note: I'm varying from official wc behavior here.
97
    //        They will take the last of -c and -m if both are used.
98
    //        I'm simply going to use -m if both are used.
99
    //        Cobra does not have a simple way of getting the order.
100
    //        Also, I really dislike -cm and -mc giving different behavior.
101
    charsEnabled := charsFlag && !bytesFlag
102
    allFlagsDisabled := !bytesFlag && !linesFlag && !wordsFlag && !charsFlag
103

104
    totalLines := 0
105
    totalWords := 0
106
    totalChars := 0
107
    totalBytes := 0
108

109
    if len(files) == 0 {
110
      reader := bufio.NewReader(os.Stdin)
111
      fileParseResult, err := getCounts(reader, "")
112
      check(err)
113

114
      totalLines += fileParseResult.lines
115
      totalWords += fileParseResult.words
116
      totalChars += fileParseResult.chars
117
      totalBytes += fileParseResult.bytes
118

119
    }
120

121
    for _, file := range files {
122
      fileReader, err := os.Open(file)
123
      check(err)
124

125
      fileParseResult, err := getCounts(fileReader, file)
126
      check(err)
127

128
      fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
129

130
      totalLines += fileParseResult.lines
131
      totalWords += fileParseResult.words
132
      totalChars += fileParseResult.chars
133
      totalBytes += fileParseResult.bytes
134
    }
135

136
    if len(files) > 1 {
137
      totalResult := FileParseResult{
138
        lines:    totalLines,
139
        words:    totalWords,
140
        chars:    totalChars,
141
        bytes:    totalBytes,
142
        filename: "total",
143
      }
144

145
      totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
146
    }
147

148
    return nil
149
  },
150
}
151

152
func Execute() {
153
  err := rootCmd.Execute()
154
  if err != nil {
155
    os.Exit(1)
156
  }
157
}
158

159
func init() {
160
  rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output.  This will cancel out any prior usage of the -m option.")
161

162
  rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")
163

164
  rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")
165

166
  rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output.  If the current locale does not support multibyte characters, this is equivalent to the -c option.  This will cancel out any prior usage of the -c option.")
167
}

There is, however, a problem. What if I do this?

# I have a large file, test.txt
mkdir temp && \
  cd temp && \
  for i in {1..100000}; do cp ../test.txt test$i.txt; done

# then...
./ccwc temp/*.txt

Oh no!

Note - you may be surprised our program supports wildcards out of the box. Actually, shells expand that for us and pass in the array of arguments to our program.

Handling Many Many Files

Now we need to add concurrency. We could make every call to getCounts in its own goroutine, and run them all immediately, but then we’d have n goroutines & open file handles at once (where n is the number of files).

I’m going to add a semaphore, just like I show here.

1
package cmd
2

3
import (
4
  "bufio"
5
  "fmt"
6
  "io"
7
  "os"
8
  "strings"
9
  "sync"
10
  "unicode/utf8"
11

12
  "wc/utils"
13

14
  "github.com/spf13/cobra"
15
)
16

17
// ...
18

19
var rootCmd = &cobra.Command{
20
  Use:   "wc",
21
  Short: "word, line, character, and byte count",
22
  Long:  `A clone of the wc command in Unix. Do "man wc" for more information.`,
23
  RunE: func(cmd *cobra.Command, files []string) error {
24
    // ...
25

26
    semaphore := utils.NewSemaphore(50)
27
    wg := sync.WaitGroup{}
28
    totals := make(chan FileParseResult)
29

30
    totalLines := 0
31
    totalWords := 0
32
    totalChars := 0
33
    totalBytes := 0
34

35
    go func() {
36
      for fileParseResult := range totals {
37
        totalLines += fileParseResult.lines
38
        totalWords += fileParseResult.words
39
        totalChars += fileParseResult.chars
40
        totalBytes += fileParseResult.bytes
41
      }
42
    }()
43

44
    if len(files) == 0 {
45
      semaphore.Acquire()
46
      wg.Add(1)
47
      go func() {
48
        defer semaphore.Release()
49
        defer wg.Done()
50

51
        reader := bufio.NewReader(os.Stdin)
52
        fileParseResult, err := getCounts(reader, "")
53
        check(err)
54

55
        totals <- fileParseResult
56
      }()
57
    }
58

59
    for _, file := range files {
60
      semaphore.Acquire()
61
      wg.Add(1)
62
      go func(file string) {
63
        defer semaphore.Release()
64
        defer wg.Done()
65

66
        fileReader, err := os.Open(file)
67
        check(err)
68

69
        fileParseResult, err := getCounts(fileReader, file)
70
        check(err)
71

72
        fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
73

74
        totals <- fileParseResult
75
      }(file)
76
    }
77

78
    wg.Wait()
79
    close(totals)
80

81
    if len(files) > 1 {
82
      totalResult := FileParseResult{
83
        lines:    totalLines,
84
        words:    totalWords,
85
        chars:    totalChars,
86
        bytes:    totalBytes,
87
        filename: "total",
88
      }
89

90
      totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
91
    }
92

93
    return nil
94
  },
95
}

Now we can run this thing on as many files as we want, and it will be fast, but also keep a reasonable memory profile.

Here is the Github repo with all of the code.

Wow! You read the whole thing. People who make it this far sometimes want to receive emails when I post something new.

I also have an RSS feed.