pyright: PDF - Removing Pages and Inserting Nested Bookmarks

I blogged before about PyPDF2 and some initial work I had done in response to a request to get a report from Microsoft SQL Server Reporting Services into PDF format. Since then I've had better luck with PyPDF2 using it with Python 3.4. Seldom do I need to make any adjustments to either the PDF file or my Python code to get things to work.

Presented below is the code that is working for me now. The basic gist of it is to strip the blank pages (conveniently SSRS dumps the report with a blank page every other page) from the SSRS PDF dump and reinsert the bookmarks in the right places in a new final document. The report I'm doing is about 30 pages, so having bookmarks is pretty critical for presentation and usability.

The approach I took was to get the bookmarks out of the PDF object model and into a nested dictionary that I could understand and work with easily. To keep the bookmarks in the right order for presentation I used collections.OrderedDict instead of just a regular Python dictionary structure. The code should work for any depth level of nested parent-child PDF bookmarks. My report only goes three or four levels deep, but things can get fairly complex even at that level.

There are a couple artifacts of the actual report I'm doing - the name "comparisonreader" refers to the subject of the report, a comparison of accounting methods' results. I've tried to sanitize the code where appropriate, but missed a thing or two.

It may be a bit overwrought (too much code), but it gets the job done. Thanks for having a look.

#!C:\python34\python

"""
Strip out blank pages and keep bookmarks for
SQL Server SSRS dump of model comparison report (pdf).
"""

import PyPDF2 as pdf
import math
from collections import OrderedDict

INPUTFILE = 'SSRSdump.pdf'
OUTPUTFILE = 'Finalreport.pdf'
OBJECTKEY = '/A'
LISTKEY = '/D'

# Adobe PDF document element keys.
FULLPAGE = '/Fit'
PAGE = '/Page'
PAGES = '/Pages'
ROOT = '/Root'
KIDS = '/Kids'
TITLE = '/Title'

# Python/PDF library types.
NODE = pdf.generic.Destination
CHILD = list

ADDPAGE = 'Adding page {0:d} from SSRS dump to page {1:d} of new document . . .'

# dictionary keys
NAME = 'name'
CHILDREN = 'children'

INDENT = 4 * ' '

ADDEDBOOKMARK = 'Added bookmark {0:s} to parent bookmark {1:s} at depthlevel {2:d}.'
TOPLEVEL = 'TOPLEVEL'

def getpages(comparisonreader):
    """
    From a PDF reader object, gets the
    page numbers of the odd numbered pages
    in the old document (SSRS dump) and
    the corresponding page in the final
    document.
    Returns a generator of two tuples.
    """
    # get number of pages then get odd numbered pages
    # (even numbered indices)
    numpages = comparisonreader.getNumPages()
    return ((x, int(x/2)) for x in range(numpages) if x % 2 == 0)

def fixbookmark(bookmark):
    """
    bookmark is a PyPDF2 bookmark object.
    Side effect function that changes bookmark
    page display mode to full page.
    """
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0

def matchpage(page, pages):
    """
    Find index of page match.
    page is a PyPDF2 page object.
    pages is the list (PyPDF2 array) of page objects.
    Returns integer page index in new (smaller) doc.
    """
    originalpageidx = pages.index(page)
    return math.floor((originalpageidx + 1)/2)

def pagedict(bookmark, pages):
    """
    Creates page dictionary for PyPDF2 bookmark object.
    bookmark is a PDF object (dictionary).
    pages is a list of PDF page objects (dictionary).
    Returns two tuple of a dictionary and
    integer page number.
    """
    page = matchpage(bookmark[PAGE].getObject(), pages)
    title = bookmark[TITLE]
    # One bookmark per page per level.
    lookupdict = OrderedDict()
    lookupdict.update({page:{NAME:title,
                             CHILDREN:OrderedDict()}})
    return lookupdict, page

def recursivepopulater(bookmark, pages):
    """
    Fills in child nodes of bookmarks
    recursively and returns dictionary.
    """
    dictx = OrderedDict()
    for pagex in bookmark:
        if type(pagex) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(pagex, pages)
            dictx.update(lookupdict)
        elif type(bookmark) is CHILD:
            newdict = OrderedDict()
            newdict.update(recursivepopulater(pagex, pages))
            dictx[page][CHILDREN].update(newdict)
    return dictx

def makenewbookmarks(pages, bookmarks):
    """
    Main function to generate bookmark dictionary:
    {page number: {name:<name>,
                   children:[<more bookmarks>]},
                   and so on.
    Returns dictionary.
    """
    dictx = OrderedDict()
    # top level bookmarks
    # it's going to go bookmark, list, bookmark, list, etc.
    for bookmark in bookmarks:
        if type(bookmark) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(bookmark, pages)
            dictx.update(lookupdict)
        elif type(bookmark) is CHILD:
            dictx[page][CHILDREN] = recursivepopulater(bookmark, pages)
    return dictx

def printbookmarkaddition(name, parentname, depthlevel):
    """
    Print notification of bookmark addition.
    Indentation based on integer depthlevel.
    name is the string name of the bookmark.
    parentname is the string name of the parent
    bookmark.
    Side effect function.
    """
    args = name, parentname, depthlevel
    indent = depthlevel * INDENT
    print(indent + ADDEDBOOKMARK.format(*args))

def dealwithbookmarks(comparisonreader, output, bookmarkdict, depthlevel, levelparent=None, parentname=None):
    """
    Fix bookmarks so that they are properly
    placed in the new document with the blank
    pages removed. Recursive side effect function.
    comparisonreader is the PDF reader object
    for the original document.

    output is the PDF writer object for the
    final document.

    bookmarkdict is a dictionary of bookmarks.

    depthlevel is the depth inside the nested
    dictionary-list structure (0 is the top).

    levelparent is the parent bookmark.

    parentname is the name of the parent bookmark.
    """
    depthlevel += 1
    for pagekeylevel in bookmarkdict:
        namelevel = bookmarkdict[pagekeylevel][NAME]
        levelparentii = output.addBookmark(namelevel, pagekeylevel, levelparent)
        if depthlevel == 0:
            parentname = TOPLEVEL
        printbookmarkaddition(namelevel, parentname, depthlevel)
        fixbookmark(levelparentii)
        # dictionary
        secondlevel = bookmarkdict[pagekeylevel][CHILDREN]
        argsx = comparisonreader, output, secondlevel, depthlevel, levelparentii, namelevel
        # Recursive call.
        dealwithbookmarks(*argsx)

def cullpages():
    """
    Fix SSRS PDF dump by removing blank
    pages.
    """
    ssrsdump = open(INPUTFILE, 'rb')
    finalreport = open(OUTPUTFILE, 'wb')
    comparisonreader = pdf.PdfFileReader(ssrsdump)
    pageindices = getpages(comparisonreader)
    output = pdf.PdfFileWriter()
    # add pages from SSRS dump to new pdf doc
    for (old, new) in pageindices:
        print(ADDPAGE.format(old, new))
        pagex = comparisonreader.getPage(old)
        output.addPage(pagex)
    # Attempt to add bookmarks from original doc
    # getOutlines yields a list of nested dictionaries and lists:
    #    outermost list - starts with parent bookmark (dictionary)
    #        inner list - starts with child bookmark (dictionary)
    #                     and so on
    # The SSRS dump and this list have bookmarks in correct order.
    bookmarks = comparisonreader.getOutlines()
    # Get page numbers using this methodology (indirect object references)
    # http://stackoverflow.com/questions/1918420/split-a-pdf-based-on-outline
    # list of IndirectObject's of pages in order
    pages = [pagen.getObject() for pagen in
            comparisonreader.trailer[ROOT].getObject()[PAGES].getObject()[KIDS]]
    # Bookmarks.
    # Top level is list of bookmarks.
    # List goes parent bookmark (Destination object)
    #               child bookmarks (list)
    #                   and so on.
    bookmarkdict = makenewbookmarks(pages, bookmarks)
    # Initial level of -1 allows increment to 0 at start.
    dealwithbookmarks(comparisonreader, output, bookmarkdict, -1)
    print('\n\nWriting final report . . .')
    output.write(finalreport)
    finalreport.close()
    ssrsdump.close()
    print('\n\nFinished.\n\n')

if __name__ == '__main__':
    cullpages()

3 comments:

ZipperSeptember 17, 2018 at 9:12 PM
Hi Carl,

Formatting issue at the imports:
import PyPDF2 as pdfimport math

should read:

import PyPDF2 as pdf
import math

Is there something missing from the pages part of this? For me the line:
pages = [pagen.getObject() for pagen in reader.trailer[ROOT].getObject()[PAGES].getObject()[KIDS]]
isn't iterating fully through all the bookmarks in my PDF. What sort of structure should I be passing on to makenewbookmarks()?
Carl TrachteSeptember 18, 2018 at 7:12 AM
Zipper,

Thank you for the heads up on the import typo and for visiting the blog.

It has been a while and I need to dig into the problem again with a test pdf. Also, I should better document the inputs to the makenewbookmarks() function. I will try to get to this this week and post an update. Thank you for your patience. CBT
ZipperSeptember 18, 2018 at 3:12 PM
No worries and thanks for the help. I'm frustrated there isn't an equal and opposite action to the reader getOutlines() function. Having got an outline, there should be a setOutlines(outline) function for the writer I think. I'm trying to use pypdf2 to crop and manipulate PDFs but in so doing, I lose my bookmark structure. Hence my visit to your page (and many others!) to get the bookmarks from outlines, into a dictionary, then added back into the new PDF.

Cheers,
Rob

pyright

Monday, September 1, 2014

PDF - Removing Pages and Inserting Nested Bookmarks

3 comments:

About Me

Blog Archive