wc clone
wc (word count) is a simple Unix utility to get line, character, word, or byte counts from a file.
Iβm going to use Go and Cobra for this project. (You may need to install the cobra-cli to run this command.)
This gives us a great starting point.
We will be working mostly in cmd/root.go
, as wc
only has 1 root command. It doesnβt have subcommands like git commit
.
Flags
Weβll use the init
function in cmd/root.go
to specify our flags. Weβre trying to mimic man wc
.
Now we can run it with go run main.go -l
, etc.
Tests
Before we go any further, Iβd like to set up some tests. Theyβll be failing of course.
You can run these with
Parsing The File
Letβs write a function to parse the file and calculate all of the necessary values. Iβm going to specify a struct as a return type for this function, so we have a clean interface to work with internally.
Now, letβs write our function to parse the file and calculate each of these values.
A few notes:
- We only want to parse the file once.
- We canβt use
scanner.Scan()
, because this will give an incorrect char count, by removing carriage returns when we have Windows newlines, e.g.\r\n
instead of\n
. Personally this doesnβt matter much to me, but I want our implementation to be consistent withwc
as much as possible. - Weβre going to take in a
io.Reader
instead of a filename, so we can use this function as well with standard input.
This can work with standard input like this:
Note that weβre also going to need a way to match wc
βs output format. This took some playing around with, but here is what I came up with:
Putting It All Together
There is, however, a problem. What if I do this?
Oh no!
Note - you may be surprised our program supports wildcards out of the box. Actually, shells expand that for us and pass in the array of arguments to our program.
Handling Many Many Files
Now we need to add concurrency. We could make every call to getCounts
in its own goroutine, and run them all immediately, but then weβd have n
goroutines & open file handles at once (where n
is the number of files).
Iβm going to add a semaphore, just like I show here.
Now we can run this thing on as many files as we want, and it will be fast, but also keep a reasonable memory profile.
Here is the Github repo with all of the code.
Wow! You read the whole thing. People who make it this far sometimes
want to receive emails when I post something new.
I also have an RSS feed.