wc clone
wc (word count) is a simple Unix utility to get line, character, word, or byte counts from a file.
# Outputs the number of lines in filename.txtwc -l filename.txt
# Outputs the number of words in filename.txtwc -w filename.txt
# Outputs the number of bytes in filename.txtwc -c filename.txt
# Outputs the number of characters in filename.txtwc -m filename.txt
# Outputs the number of lines, words, and characters in filename.txtwc -lwm filename.txt
# Outputs the number of lines by piping the contents of filename.txt to wccat filename.txt | wc -l
# Outputs line, word, and byte counts for multiple files and a total line at the endwc filename.txt anotherfile.txt
# Outputs the number of lines for all text files in the current directory, summarizedwc -l *.txt
Iβm going to use Go and Cobra for this project. (You may need to install the cobra-cli to run this command.)
mkdir wc && \ cd wc && \ go mod init wc && \ cobra-cli init
This gives us a great starting point.
tree ..βββ LICENSEβββ cmdβΒ Β βββ root.goβββ go.modβββ go.sumβββ main.go
2 directories, 5 files
We will be working mostly in cmd/root.go
, as wc
only has 1 root command. It doesnβt have subcommands like git commit
.
Flags
Weβll use the init
function in cmd/root.go
to specify our flags. Weβre trying to mimic man wc
.
func init() { rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output. This will cancel out any prior usage of the -m option.")
rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")
rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")
rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output. If the current locale does not support multibyte characters, this is equivalent to the -c option. This will cancel out any prior usage of the -c option.")}
Now we can run it with go run main.go -l
, etc.
Tests
Before we go any further, Iβd like to set up some tests. Theyβll be failing of course.
#!/bin/bash
# Build the ccwc binarygo build -o ccwc
# Compare counts and exit with code 1 if any difference is foundecho "Running: diff <(wc test.txt) <(./ccwc test.txt)" && diff <(wc test.txt) <(./ccwc test.txt) && \echo "Running: diff <(wc -l test.txt) <(./ccwc -l test.txt)" && diff <(wc -l test.txt) <(./ccwc -l test.txt) && \echo "Running: diff <(wc -w test.txt) <(./ccwc -w test.txt)" && diff <(wc -w test.txt) <(./ccwc -w test.txt) && \echo "Running: diff <(wc -c test.txt) <(./ccwc -c test.txt)" && diff <(wc -c test.txt) <(./ccwc -c test.txt) && \echo "Running: diff <(wc -m test.txt) <(./ccwc -m test.txt)" && diff <(wc -m test.txt) <(./ccwc -m test.txt) && \echo "Running: diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt)" && diff <(wc -l -w test.txt) <(./ccwc -l -w test.txt) && \echo "Running: diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt)" && diff <(wc -l -c test.txt) <(./ccwc -l -c test.txt) && \echo "Running: diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt)" && diff <(wc -w -c test.txt) <(./ccwc -w -c test.txt) && \# This is the case I'm choosing to differ from wc on.# echo "Running: diff <(wc -cm test.txt) <(./ccwc -cm test.txt)" && diff <(wc -cm test.txt) <(./ccwc -cm test.txt) && \echo "Running: diff <(wc -mc test.txt) <(./ccwc -mc test.txt)" && diff <(wc -mc test.txt) <(./ccwc -mc test.txt) && \echo "Running: diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt)" && diff <(wc -l -w -c test.txt) <(./ccwc -l -w -c test.txt) || \exit 1
echo "All tests passed!"
# Clean up the ccwc binaryrm ccwc
You can run these with
chmod +x ./test.sh./test.sh
Parsing The File
Letβs write a function to parse the file and calculate all of the necessary values. Iβm going to specify a struct as a return type for this function, so we have a clean interface to work with internally.
type FileParseResult struct { filename string lines int words int chars int bytes int}
Now, letβs write our function to parse the file and calculate each of these values.
A few notes:
- We only want to parse the file once.
- We canβt use
scanner.Scan()
, because this will give an incorrect char count, by removing carriage returns when we have Windows newlines, e.g.\r\n
instead of\n
. Personally this doesnβt matter much to me, but I want our implementation to be consistent withwc
as much as possible. - Weβre going to take in a
io.Reader
instead of a filename, so we can use this function as well with standard input.
func getCounts(rd io.Reader, name string) (FileParseResult, error) { // @note: cannot use scanner because new line characters // are stripped, and \n vs. \n\r affects the char count reader := bufio.NewReader(rd) lines := 0 words := 0 chars := 0 bytes := 0
for { line, err := reader.ReadString('\n') if err != nil && err != io.EOF { return FileParseResult{}, err }
// @note: will count an extra line if the file ends with a newline if err == io.EOF && len(line) == 0 { break }
lines++ words += len(strings.Fields(line)) chars += utf8.RuneCountInString(line) bytes += len(line)
if err == io.EOF { break } }
return FileParseResult{ lines: lines, words: words, chars: chars, bytes: bytes, filename: name, }, nil}
This can work with standard input like this:
reader := bufio.NewReader(os.Stdin)fileParseResult, err := getCounts(reader, "")
Note that weβre also going to need a way to match wc
βs output format. This took some playing around with, but here is what I came up with:
func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) { s := ""
if linesFlag || allFlagsDisabled { s += fmt.Sprintf("%8d", f.lines) } if wordsFlag || allFlagsDisabled { s += fmt.Sprintf("%8d", f.words) } if charsEnabled { s += fmt.Sprintf("%8d", f.chars) } if bytesFlag || allFlagsDisabled { s += fmt.Sprintf("%8d", f.bytes) }
fmt.Printf(s + " " + f.filename + "\n")}
Putting It All Together
1package cmd2
3import (4 "bufio"5 "fmt"6 "io"7 "os"8 "strings"9 "unicode/utf8"10
11 "github.com/spf13/cobra"12)13
14func check(e error) {15 if e != nil {16 panic(e)17 }18}19
20func getCounts(rd io.Reader, name string) (FileParseResult, error) {21 // @note: cannot use scanner because new line characters22 // are stripped, and \n vs. \n\r affects the char count23 reader := bufio.NewReader(rd)24 lines := 025 words := 026 chars := 027 bytes := 028
29 for {30 line, err := reader.ReadString('\n')31 if err != nil && err != io.EOF {32 return FileParseResult{}, err33 }34
35 // @note: will count an extra line if the file ends with a newline36 if err == io.EOF && len(line) == 0 {37 break38 }39
40 lines++41 words += len(strings.Fields(line))42 chars += utf8.RuneCountInString(line)43 bytes += len(line)44
45 if err == io.EOF {46 break47 }48 }49
50 return FileParseResult{51 lines: lines,52 words: words,53 chars: chars,54 bytes: bytes,55 filename: name,56 }, nil57}58
59type FileParseResult struct {60 filename string61 lines int62 words int63 chars int64 bytes int65}66
67func (f FileParseResult) Println(bytesFlag bool, linesFlag bool, wordsFlag bool, charsEnabled bool, allFlagsDisabled bool) {68 s := ""69
70 if linesFlag || allFlagsDisabled {71 s += fmt.Sprintf("%8d", f.lines)72 }73 if wordsFlag || allFlagsDisabled {74 s += fmt.Sprintf("%8d", f.words)75 }76 if charsEnabled {77 s += fmt.Sprintf("%8d", f.chars)78 }79 if bytesFlag || allFlagsDisabled {80 s += fmt.Sprintf("%8d", f.bytes)81 }82
83 fmt.Printf(s + " " + f.filename + "\n")84}85
86var rootCmd = &cobra.Command{87 Use: "wc",88 Short: "word, line, character, and byte count",89 Long: `A clone of the wc command in Unix. Do "man wc" for more information.`,90 RunE: func(cmd *cobra.Command, files []string) error {91 bytesFlag, _ := cmd.Flags().GetBool("bytes")92 linesFlag, _ := cmd.Flags().GetBool("lines")93 wordsFlag, _ := cmd.Flags().GetBool("words")94 charsFlag, _ := cmd.Flags().GetBool("chars")95
96 // @note: I'm varying from official wc behavior here.97 // They will take the last of -c and -m if both are used.98 // I'm simply going to use -m if both are used.99 // Cobra does not have a simple way of getting the order.100 // Also, I really dislike -cm and -mc giving different behavior.101 charsEnabled := charsFlag && !bytesFlag102 allFlagsDisabled := !bytesFlag && !linesFlag && !wordsFlag && !charsFlag103
104 totalLines := 0105 totalWords := 0106 totalChars := 0107 totalBytes := 0108
109 if len(files) == 0 {110 reader := bufio.NewReader(os.Stdin)111 fileParseResult, err := getCounts(reader, "")112 check(err)113
114 totalLines += fileParseResult.lines115 totalWords += fileParseResult.words116 totalChars += fileParseResult.chars117 totalBytes += fileParseResult.bytes118
119 }120
121 for _, file := range files {122 fileReader, err := os.Open(file)123 check(err)124
125 fileParseResult, err := getCounts(fileReader, file)126 check(err)127
128 fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)129
130 totalLines += fileParseResult.lines131 totalWords += fileParseResult.words132 totalChars += fileParseResult.chars133 totalBytes += fileParseResult.bytes134 }135
136 if len(files) > 1 {137 totalResult := FileParseResult{138 lines: totalLines,139 words: totalWords,140 chars: totalChars,141 bytes: totalBytes,142 filename: "total",143 }144
145 totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)146 }147
148 return nil149 },150}151
152func Execute() {153 err := rootCmd.Execute()154 if err != nil {155 os.Exit(1)156 }157}158
159func init() {160 rootCmd.Flags().BoolP("bytes", "c", false, "The number of bytes in each input file is written to the standard output. This will cancel out any prior usage of the -m option.")161
162 rootCmd.Flags().BoolP("lines", "l", false, "The number of lines in each input file is written to the standard output.")163
164 rootCmd.Flags().BoolP("words", "w", false, "The number of words in each input file is written to the standard output.")165
166 rootCmd.Flags().BoolP("chars", "m", false, "The number of characters in each input file is written to the standard output. If the current locale does not support multibyte characters, this is equivalent to the -c option. This will cancel out any prior usage of the -c option.")167}
There is, however, a problem. What if I do this?
# I have a large file, test.txtmkdir temp && \ cd temp && \ for i in {1..100000}; do cp ../test.txt test$i.txt; done
# then..../ccwc temp/*.txt
Oh no!
Note - you may be surprised our program supports wildcards out of the box. Actually, shells expand that for us and pass in the array of arguments to our program.
Handling Many Many Files
Now we need to add concurrency. We could make every call to getCounts
in its own goroutine, and run them all immediately, but then weβd have n
goroutines & open file handles at once (where n
is the number of files).
Iβm going to add a semaphore, just like I show here.
1package cmd2
3import (4 "bufio"5 "fmt"6 "io"7 "os"8 "strings"9 "sync"10 "unicode/utf8"11
12 "wc/utils"13
14 "github.com/spf13/cobra"15)16
17// ...18
19var rootCmd = &cobra.Command{20 Use: "wc",21 Short: "word, line, character, and byte count",22 Long: `A clone of the wc command in Unix. Do "man wc" for more information.`,23 RunE: func(cmd *cobra.Command, files []string) error {24 // ...25
26 semaphore := utils.NewSemaphore(50)27 wg := sync.WaitGroup{}28 totals := make(chan FileParseResult)29
30 totalLines := 031 totalWords := 032 totalChars := 033 totalBytes := 034
35 go func() {36 for fileParseResult := range totals {37 totalLines += fileParseResult.lines38 totalWords += fileParseResult.words39 totalChars += fileParseResult.chars40 totalBytes += fileParseResult.bytes41 }42 }()43
44 if len(files) == 0 {45 semaphore.Acquire()46 wg.Add(1)47 go func() {48 defer semaphore.Release()49 defer wg.Done()50
51 reader := bufio.NewReader(os.Stdin)52 fileParseResult, err := getCounts(reader, "")53 check(err)54
55 totals <- fileParseResult56 }()57 }58
59 for _, file := range files {60 semaphore.Acquire()61 wg.Add(1)62 go func(file string) {63 defer semaphore.Release()64 defer wg.Done()65
66 fileReader, err := os.Open(file)67 check(err)68
69 fileParseResult, err := getCounts(fileReader, file)70 check(err)71
72 fileParseResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)73
74 totals <- fileParseResult75 }(file)76 }77
78 wg.Wait()79 close(totals)80
81 if len(files) > 1 {82 totalResult := FileParseResult{83 lines: totalLines,84 words: totalWords,85 chars: totalChars,86 bytes: totalBytes,87 filename: "total",88 }89
90 totalResult.Println(bytesFlag, linesFlag, wordsFlag, charsEnabled, allFlagsDisabled)91 }92
93 return nil94 },95}
Now we can run this thing on as many files as we want, and it will be fast, but also keep a reasonable memory profile.
Here is the Github repo with all of the code.
Wow! You read the whole thing. People who make it this far sometimes
want to receive emails when I post something new.
I also have an RSS feed.