Friday, March 28, 2014

Editing a PDF file with Python (with a little help from PDFTKBuilder)

I'm working with a report published with SQL Server Reporting Services (SSRS).  The report is located on a remote server in Africa.  It is inconvenient for management in North America to view the report and print it from a browser (slow connection, formatting issues).  Instead, management would like a PDF file of the report to be e-mailed out to a distribution list.

This post deals with taking the PDF dumped from the SSRS web report and cleaning it up for viewing and navigation (bookmarking).  I didn't know a great deal about PDF's before working on this.  My ignorance will probably be reflected in the terminology I use and my approach.  Nonetheless, the problem was a bit more involved than I anticipated.  My intent is to put my experience out there and, if I have made things harder than necessary, get some feedback in the comments.

I think it's fair to say that SSRS is not a mature product yet, but, in a Microsoft/Windows environment its usefulness trumps that.  The dump to PDF or Excel feature for reports is handy, but doesn't always yield an output format consistent with the SSRS web report.  The first problem I had was a "corrupt" PDF dump.  The file opens fine in Acrobat Reader, but doesn't behave well when one attempts to copy its contents with modifications to another file with PyPDF2 (this is just a straight copy of pages from one pdf file to another new one):

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> dumpfile = open('baddumpfromssrs.pdf', 'rb')
>>> reader = pdf.PdfFileReader(dumpfile)
>>> numpages = reader.getNumPages()
>>> numpages
54
>>> outputfile = open('testoutput.pdf', 'wb')
>>> writer = pdf.PdfFileWriter()
>>> for x in xrange(numpages):
...     writer.addPage(reader.getPage(x))
...
>>> writer.write(outputfile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 279, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 352, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 381, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 372, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 1164, in getObject
    retval = readObject(self.stream, self)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 71, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 587, in readFromStream
    value = readObject(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 91, in readObject
    return NumberObject.readFromStream(stream)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 257, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''
>>>


Bummer.  I had to google around until something showed up on Stackoverflow.  There's a comment on the post that suggests the use of pdftk to "un-corrupt" the file.  To avoid having to have an admin install something on my work computer, I downloaded PDFTKBuilder Portable.  This is really overkill, because I didn't need the user interface to clean up the file.  There is an App folder in the pdftk portable install that has the pdftk command line tool:

C:\blogdoc>pdftk baddumpfromssrs.pdf output good.pdf

This "worked" in terms of preparing the file to be dealt with with Python and PyPDF2, but not before I opened it in Adobe Reader and closed it.  I'm working in a corporate environment under Windows 7.  I double checked to see that I had the same command in the command window (I used the up arrow key to recall it to issue the command that worked).  I don't know what's going on there.  The important thing was that I could proceed with the non-corrupt file good.pdf.

While I'm on things that weren't working, I should probably mention the Python 3/Python 2 thing.  This originally gave an error on Python 3; when I tried to reproduce the problem, it hung forever and I had to kill it with Ctrl-C:

Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:24:06) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> pagex = reader.getPage(0)
>>> pagex.extractText()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2070, in extractText
    content = ContentStream(content, self.pdf)
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2153, in __init__
    self.__parseContentStream(stream)
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2173, in __parseContentStream
    operator += tok
KeyboardInterrupt


I'm a bit of a Python 3 advocate, sometimes even a zealot.  Still, pain won over my conviction and I switched to Python 2.7 where I got better results.  There is mention of this problem (with error) on StackOverflow.  A comment makes mention of replacing a couple PyPDF2 source files to make sure it runs with Python 3.  I couldn't find the link and took the expedient Python 2.7 route.

This is about where everything started to work the way it was supposed to.  Now I could get down to fixing the SSRS pdf report dump.  The first thing that needed to happen was the removal of a bunch of blank pages from the report.  On the SSRS web report they weren't there, but the PDF file had a blank page everywhere there was a page break.  Conveniently, this was every other page:

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> numpages = reader.getNumPages()
>>> numpages
54
>>> contentpages = (x for x in xrange(numpages) if x % 2 == 0)
>>> writer = pdf.PdfFileWriter()
>>> for n in contentpages:
...     pagex = reader.getPage(n)
...     writer.addPage(pagex)
...
>>> writer.write(outputfile)
>>> outputfile.close()
>>> inputfile.close()


Super!  Now I've got a 27 page document with content on every page.  The next thing I needed in my case was a banner or mark across each page saying, "DRAFT FORMAT."  The specific idea was that this was a sample report being circulated for comments and approval.

I didn't want super bold red text across the page, rather white text outlined in red.  Some googling paid off with a suggestion from a mailing list.  reportlab.pdfgen is the tool used for creating the file with the banner.  We'll merge it to the pages of the main report document later.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> # want red outline
...
>>> c.setStrokeColor((1, 0, 0))
>>> # inside of letters should be white
...
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> # origin is at bottom, left of page
...
>>> c.drawString(2 * 72, 7 * 72, 'DRAFT FORMAT')
>>> c.save()
>>>
>>>


Great, I've got a banner.


There is a whole bunch of stuff in that code segment that I'm leaving unexplained.  Not a big surprise, but to use a Python API to edit PDF's, you need to know something about the format.  This has been a huge learning experience over the course of a day or two for me.  What helped me most is the reportlab documentation.  After copying a code snippet and seeing that it worked, I could go back there and try to figure out how it works.  This learning experience is a work in progress.  There are things you pick up right away, though.  For instance, Adobe Reader comes with 14 base fonts of which Hevletica is one.  Who knew?  Not I!

My banner isn't quite the way I want it.  It's horizontal and I would like to tilt it to 45 degrees.  Google again to the rescue.  Some kind soul has already covered it on a blog.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT FORMAT')
>>> c.save()
>>>


Close enough.  Confession - I don't really think things through and measure with trigonometry what it will take to get placement right; I just "hack" until it looks about right.  This is a habit I should break if I continue to have to play with pdf's.





 
 
Now we'll merge the banner with some pages from another pdf to make a new document.  I'm going to use pages from the reportlab documentation because there's all kinds of work stuff in the pdf I generated above.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> bannerfile = open('banner.pdf', 'rb')
>>> docfile = open('docfile.pdf', 'rb')
>>> outputfile = open('newfile.pdf', 'wb')
>>> readerbanner = pdf.PdfFileReader(bannerfile)
>>> readerdoc = pdf.PdfFileReader(docfile)
>>> writernewdoc = pdf.PdfFileWriter()
>>> pagesdoc = (readerdoc.getPage(x) for x in xrange(286, 291))
>>> for pagen in pagesdoc:
...     writernewdoc.addPage(pagen)
...
>>> writernewdoc.write(outputfile)
>>> outputfile.close()
>>> docfile.close()
>>> # now merge banner to pages of new file
...
>>> opaquebannerfile = open('opaquebannerfile.pdf', 'wb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> bannerpage = readerbanner.getPage(0)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> writeropaquebanner = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeropaquebanner.addPage(pagex)
...
>>> writeropaquebanner.write(opaquebannerfile)
>>> opaquebannerfile.close()
>>> bannerfile.close()
>>> testpagefile.close()
>>>


It's not perfect, but it's essentially what I wanted (my centering of the banner could be better).



What if I wanted a transparent banner to emphasize the draft nature of the content rather than that of the format?

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('transparent.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> transparentwhite = canx.Color(255, 255, 255, alpha = 0.0)
>>> c.setFillColor(transparentwhite)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT')
>>> c.save()
>>>
>>> # merge again
...
>>> transparentbannerfile = open('transparent.pdf', 'rb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> outputfile = open('mergedtransparent.pdf', 'wb')
>>> import PyPDF2 as pdf
>>> readerbanner = pdf.PdfFileReader(transparentbannerfile)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> bannerpage = readerbanner.getPage(0)
>>> writeroutput = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeroutput.addPage(pagex)
...
>>> writeroutput.write(outputfile)
>>> outputfile.close()
>>> transparentbannerfile.close()
>>> testpagefile.close()
>>>


Not beautiful (I would make the banner font edge thinner), but it is indeed transparent.

 

The transparency part is the alpha value in the code for the color transparentwhite.  There is some sample code that shows how to do this on reportlab.com's site.

The last thing I needed to deal with was bookmarks.  I had some problems initially in that, although the bookmark showed up, it ended up at the bottom of the page underneath the SSRS tables and charts I was trying to reference.  I got around this by digging into the dictionary structure of the PyPDF2 Bookmark object.  Here is the (one line) function code:

OBJECTKEY = '/A'
LISTKEY = '/D'


FULLPAGE = '/Fit'

def fixbookmark(bookmark):
    """
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    """
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0


addBookmark is a method of the PyPDF2.PdfFileWriter object.  It takes a string name, a page index (zero based), and an optional parent PyPDF2 Bookmark object.  The references in my fixbookmark function "take" prior to writing the pdf to disk with the write method of the PyPDF2.PdfFileWriter object.

Mike Driscoll blogged about PyPDF2 a couple years back.  He's got a whole series on PDF's in fact (aside: the man is a pragmatic programming blogging machine).  There are good code snippets and pretty good comment threads on those posts for newbs like me.  I found that rl118.pdf doc useful for familiarizing myself with the pdf file format and constants used to reference objects within the file format.

This was a bit of an experience dump on my part.  If you've read this far, thank you for your patience and for having a look.



No comments:

Post a Comment