Published on

wc clone

wc (word count) is a simple Unix utility to get line, character, word, or byte counts from a file.

# Outputs the number of lines in filename.txt
wc -l filename.txt

# Outputs the number of words in filename.txt
wc -w filename.txt

# Outputs the number of bytes in filename.txt
wc -c filename.txt

# Outputs the number of characters in filename.txt
wc -m filename.txt

# Outputs the number of lines, words, and characters in filename.txt
wc -lwm filename.txt

# Outputs the number of lines by piping the contents of filename.txt to wc
cat filename.txt | wc -l

# Outputs line, word, and byte counts for multiple files and a total line at the end
wc filename.txt anotherfile.txt

# Outputs the number of lines for all text files in the current directory, summarized
wc -l *.txt

I'm going to use Go and Cobra for this project. (You may need to install the cobra-cli to run this command.)

mkdir wc && \
	cd wc && \
	go mod init wc && \
	cobra-cli init

This gives us a great starting point.

tree .
.
├── LICENSE
├── cmd
│   └── root.go
├── go.mod
├── go.sum
└── main.go

2 directories, 5 files

We will be working mostly in cmd/root.go, as wc only has 1 root command. It doesn't have subcommands like git commit.

Flags

We'll use the init function in cmd/root.go to specify our flags. We're trying to mimic man wc.

func init() {
	rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output.  This will cancel out any prior usage of the -m option.")

	rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")

	rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")

	rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output.  If the current locale does not support multibyte characters, this is equivalent to the -c option.  This will cancel out any prior usage of the -c option.")
}

Now we can run it with go run main.go -l, etc.

Tests

Before we go any further, I'd like to set up some tests. They'll be failing of course.

#!/bin/bash

# Build the ccwc binary
go build -o ccwc

# Compare counts and exit with code 1 if any difference is found
echo "Running: diff <(wc test.txt) <(./ccwc test.txt)" && diff <(wc test.txt) <(./ccwc test.txt) && \
echo "Running: diff <(wc -l test.txt) <(./ccwc -l test.txt)" && diff <(wc -l test.txt) <(./ccwc -l test.txt) && \
echo "Running: diff <(wc -w test.txt) <(./ccwc -w test.txt)" && diff <(wc -w test.txt) <(./ccwc -w test.txt) && \
echo "Running: diff <(wc -c test.txt) <(./ccwc -c test.txt)" && diff <(wc -c test.txt) <(./ccwc -c test.txt) && \
echo "Running: diff <(wc -m test.txt) <(./ccwc -m test.txt)" && diff <(wc -m test.txt) <(./ccwc -m test.txt) && \
echo "Running: diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt)" && diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt) && \
echo "Running: diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt)" && diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt) && \
echo "Running: diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt)" && diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt) && \
# This is the case I'm choosing to differ from wc on.
# echo "Running: diff <(wc -cm test.txt) <(./ccwc -cm test.txt)" && diff <(wc -cm test.txt) <(./ccwc -cm test.txt) && \
echo "Running: diff <(wc -mc test.txt) <(./ccwc -mc test.txt)" && diff <(wc -mc test.txt) <(./ccwc -mc test.txt) && \
echo "Running: diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt)" && diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt) || \
exit 1

echo "All tests passed!"

# Clean up the ccwc binary
rm ccwc

You can run these with

chmod +x ./test.sh
./test.sh

Parsing The File

Let's write a function to parse the file and calculate all of the necessary values. I'm going to specify a struct as a return type for this function, so we have a clean interface to work with internally.

type FileParseResult struct {
	filename string
	lines    int
	words    int
	chars    int
	bytes    int
}

Now, let's write our function to parse the file and calculate each of these values.

A few notes:

  • We only want to parse the file once.
  • We can't use scanner.Scan(), because this will give an incorrect char count, by removing carriage returns when we have Windows newlines, e.g. \r\n instead of \n. Personally this doesn't matter much to me, but I want our implementation to be consistent with wc as much as possible.
  • We're going to take in a io.Reader instead of a filename, so we can use this function as well with standard input.
func getCounts(rd io.Reader, name string) (FileParseResult, error) {
	// @note: cannot use scanner because new line characters
	//        are stripped, and \n vs. \n\r affects the char count
	reader := bufio.NewReader(rd)
	lines := 0
	words := 0
	chars := 0
	bytes := 0

	for {
		line, err := reader.ReadString('\n')
		if err != nil && err != io.EOF {
			return FileParseResult{}, err
		}

		// @note: will count an extra line if the file ends with a newline
		if err == io.EOF && len(line) == 0 {
			break
		}

		lines++
		words += len(strings.Fields(line))
		chars += utf8.RuneCountInString(line)
		bytes += len(line)

		if err == io.EOF {
			break
		}
	}

	return FileParseResult{
		lines:    lines,
		words:    words,
		chars:    chars,
		bytes:    bytes,
		filename: name,
	}, nil
}

This can work with standard input like this:

reader := bufio.NewReader(os.Stdin)
fileParseResult, err := getCounts(reader, "")

Note that we're also going to need a way to match wc's output format. This took some playing around with, but here is what I came up with:

func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {
	s := ""

	if linesFlag || allFlagsDisabled {
		s += fmt.Sprintf("%8d", f.lines)
	}
	if wordsFlag || allFlagsDisabled {
		s += fmt.Sprintf("%8d", f.words)
	}
	if charsEnabled {
		s += fmt.Sprintf("%8d", f.chars)
	}
	if bytesFlag || allFlagsDisabled {
		s += fmt.Sprintf("%8d", f.bytes)
	}

	fmt.Printf(s + " " + f.filename + "\n")
}

Putting It All Together

root.go
package cmd

import (
	"bufio"
	"fmt"
	"io"
	"os"
	"strings"
	"unicode/utf8"

	"github.com/spf13/cobra"
)

func check(e error) {
	if e != nil {
		panic(e)
	}
}

func getCounts(rd io.Reader, name string) (FileParseResult, error) {
	// @note: cannot use scanner because new line characters
	//        are stripped, and \n vs. \n\r affects the char count
	reader := bufio.NewReader(rd)
	lines := 0
	words := 0
	chars := 0
	bytes := 0

	for {
		line, err := reader.ReadString('\n')
		if err != nil && err != io.EOF {
			return FileParseResult{}, err
		}

		// @note: will count an extra line if the file ends with a newline
		if err == io.EOF && len(line) == 0 {
			break
		}

		lines++
		words += len(strings.Fields(line))
		chars += utf8.RuneCountInString(line)
		bytes += len(line)

		if err == io.EOF {
			break
		}
	}

	return FileParseResult{
		lines:    lines,
		words:    words,
		chars:    chars,
		bytes:    bytes,
		filename: name,
	}, nil
}

type FileParseResult struct {
	filename string
	lines    int
	words    int
	chars    int
	bytes    int
}

func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {
	s := ""

	if linesFlag || allFlagsDisabled {
		s += fmt.Sprintf("%8d", f.lines)
	}
	if wordsFlag || allFlagsDisabled {
		s += fmt.Sprintf("%8d", f.words)
	}
	if charsEnabled {
		s += fmt.Sprintf("%8d", f.chars)
	}
	if bytesFlag || allFlagsDisabled {
		s += fmt.Sprintf("%8d", f.bytes)
	}

	fmt.Printf(s + " " + f.filename + "\n")
}

var rootCmd = &cobra.Command{
	Use:   "wc",
	Short: "word, line, character, and byte count",
	Long:  `A clone of the wc command in Unix. Do "man wc" for more information.`,
	RunE: func(cmd *cobra.Command, files []string) error {
		bytesFlag, _ := cmd.Flags().GetBool("bytes")
		linesFlag, _ := cmd.Flags().GetBool("lines")
		wordsFlag, _ := cmd.Flags().GetBool("words")
		charsFlag, _ := cmd.Flags().GetBool("chars")

		// @note: I'm varying from official wc behavior here.
		//        They will take the last of -c and -m if both are used.
		//        I'm simply going to use -m if both are used.
		//        Cobra does not have a simple way of getting the order.
		//        Also, I really dislike -cm and -mc giving different behavior.
		charsEnabled := charsFlag && !bytesFlag
		allFlagsDisabled := !bytesFlag && !linesFlag && !wordsFlag && !charsFlag

		totalLines := 0
		totalWords := 0
		totalChars := 0
		totalBytes := 0

		if len(files) == 0 {
			reader := bufio.NewReader(os.Stdin)
			fileParseResult, err := getCounts(reader, "")
			check(err)

			totalLines += fileParseResult.lines
			totalWords += fileParseResult.words
			totalChars += fileParseResult.chars
			totalBytes += fileParseResult.bytes

		}

		for _, file := range files {
			fileReader, err := os.Open(file)
			check(err)

			fileParseResult, err := getCounts(fileReader, file)
			check(err)

			fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)

			totalLines += fileParseResult.lines
			totalWords += fileParseResult.words
			totalChars += fileParseResult.chars
			totalBytes += fileParseResult.bytes
		}

		if len(files) > 1 {
			totalResult := FileParseResult{
				lines:    totalLines,
				words:    totalWords,
				chars:    totalChars,
				bytes:    totalBytes,
				filename: "total",
			}

			totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
		}

		return nil
	},
}

func Execute() {
	err := rootCmd.Execute()
	if err != nil {
		os.Exit(1)
	}
}

func init() {
	rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output.  This will cancel out any prior usage of the -m option.")

	rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")

	rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")

	rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output.  If the current locale does not support multibyte characters, this is equivalent to the -c option.  This will cancel out any prior usage of the -c option.")
}

There is, however, a problem. What if I do this?

# I have a large file, test.txt
mkdir temp && \
	cd temp && \
	for i in {1..100000}; do cp ../test.txt test$i.txt; done

# then...
./ccwc temp/*.txt

Oh no!

Note - you may be surprised our program supports wildcards out of the box. Actually, shells expand that for us and pass in the array of arguments to our program.

Handling Many Many Files

Now we need to add concurrency. We could make every call to getCounts in its own goroutine, and run them all immediately, but then we'd have n goroutines & open file handles at once (where n is the number of files).

I'm going to add a semaphore, just like I show here.

root.go
package cmd

import (
	"bufio"
	"fmt"
	"io"
	"os"
	"strings"
	"sync"
	"unicode/utf8"

	"wc/utils"

	"github.com/spf13/cobra"
)

// ...

var rootCmd = &cobra.Command{
	Use:   "wc",
	Short: "word, line, character, and byte count",
	Long:  `A clone of the wc command in Unix. Do "man wc" for more information.`,
	RunE: func(cmd *cobra.Command, files []string) error {
		// ...

		semaphore := utils.NewSemaphore(50)
		wg := sync.WaitGroup{}
		totals := make(chan FileParseResult)

		totalLines := 0
		totalWords := 0
		totalChars := 0
		totalBytes := 0

		go func() {
			for fileParseResult := range totals {
				totalLines += fileParseResult.lines
				totalWords += fileParseResult.words
				totalChars += fileParseResult.chars
				totalBytes += fileParseResult.bytes
			}
		}()

		if len(files) == 0 {
			semaphore.Acquire()
			wg.Add(1)
			go func() {
				defer semaphore.Release()
				defer wg.Done()

				reader := bufio.NewReader(os.Stdin)
				fileParseResult, err := getCounts(reader, "")
				check(err)

				totals <- fileParseResult
			}()
		}

		for _, file := range files {
			semaphore.Acquire()
			wg.Add(1)
			go func(file string) {
				defer semaphore.Release()
				defer wg.Done()

				fileReader, err := os.Open(file)
				check(err)

				fileParseResult, err := getCounts(fileReader, file)
				check(err)

				fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)

				totals <- fileParseResult
			}(file)
		}

		wg.Wait()
		close(totals)

		if len(files) > 1 {
			totalResult := FileParseResult{
				lines:    totalLines,
				words:    totalWords,
				chars:    totalChars,
				bytes:    totalBytes,
				filename: "total",
			}

			totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)
		}

		return nil
	},
}

Now we can run this thing on as many files as we want, and it will be fast, but also keep a reasonable memory profile.

Here is the Github repo with all of the code.