February 14, 2009 by alex
On a recent Rails project we have been working a lot with PDF documents. The application generates PDF invoices and account statements, places markers and comments into existing documents and also merges multiple single page PDFs into one. We do all this with a few PDF tools:
On OSX all of these can be installed via MacPorts. Debian has packages as well.
While PDF::Writer is a Ruby library pdftk is a command line tool. We simple call it using Kernel.system
and check the return code via the $?
variable.
With all this PDF processing the need for testing the contents of the generated documents arose. We have factored out all the PDF processing into a bunch of extra classes which we simply unit test with RSpec: make sure the parameters are passed to the command line correctly, that the right exception is thrown for each return code etc.
In addition to unit testing we also write customer driven acceptance tests with Cucumber, where we assert on a high level the outcome of certain actions. With HTML pages we can simply use the built-in steps that in turn use Webrat to parse the HTML like this:
Now in our case the invoice link links to a PDF but we still want to know what’s inside the document. The solution we came up with looks like this:
What this does in the background is follow the link as usual, write the response into a temporary file, convert that to text using pdftotext and write the result back into the response. This way we can make assertions about the contents of the PDF almost as if it were an HTML page (except for tags of course). Here is the implementation: