Monday, September 1, 2014

PDF - Removing Pages and Inserting Nested Bookmarks

I blogged before about PyPDF2 and some initial work I had done in response to a request to get a report from Microsoft SQL Server Reporting Services into PDF format.  Since then I've had better luck with PyPDF2 using it with Python 3.4.  Seldom do I need to make any adjustments to either the PDF file or my Python code to get things to work.

Presented below is the code that is working for me now.  The basic gist of it is to strip the blank pages (conveniently SSRS dumps the report with a blank page every other page) from the SSRS PDF dump and reinsert the bookmarks in the right places in a new final document.  The report I'm doing is about 30 pages, so having bookmarks is pretty critical for presentation and usability.

The approach I took was to get the bookmarks out of the PDF object model and into a nested dictionary that I could understand and work with easily.  To keep the bookmarks in the right order for presentation I used collections.OrderedDict instead of just a regular Python dictionary structure.  The code should work for any depth level of nested parent-child PDF bookmarks.  My report only goes three or four levels deep, but things can get fairly complex even at that level.

There are a couple artifacts of the actual report I'm doing - the name "comparisonreader" refers to the subject of the report, a comparison of accounting methods' results.  I've tried to sanitize the code where appropriate, but missed a thing or two.

It may be a bit overwrought (too much code), but it gets the job done.  Thanks for having a look.


Strip out blank pages and keep bookmarks for
SQL Server SSRS dump of model comparison report (pdf).

import PyPDF2 as pdfimport math
from collections import OrderedDict

INPUTFILE = 'SSRSdump.pdf'
OUTPUTFILE = 'Finalreport.pdf'


# Adobe PDF document element keys.
PAGE = '/Page'
PAGES = '/Pages'
ROOT = '/Root'
KIDS = '/Kids'
TITLE = '/Title'

# Python/PDF library types.
NODE = pdf.generic.Destination
CHILD = list

ADDPAGE = 'Adding page {0:d} from SSRS dump to page {1:d} of new document . . .'

# dictionary keys
NAME = 'name'
CHILDREN = 'children'

INDENT = 4 * ' '

ADDEDBOOKMARK = 'Added bookmark {0:s} to parent bookmark {1:s} at depthlevel {2:d}.'


def getpages(comparisonreader):
    From a PDF reader object, gets the
    page numbers of the odd numbered pages
    in the old document (SSRS dump) and
    the corresponding page in the final

    Returns a generator of two tuples.
    # get number of pages then get odd numbered pages
    # (even numbered indices)
    numpages = comparisonreader.getNumPages()
    return ((x, int(x/2)) for x in range(numpages) if x % 2 == 0)

def fixbookmark(bookmark):
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0

def matchpage(page, pages):
    Find index of page match.

    page is a PyPDF2 page object.
    pages is the list (PyPDF2 array) of page objects.
    Returns integer page index in new (smaller) doc.
    originalpageidx = pages.index(page)
    return math.floor((originalpageidx + 1)/2)

def pagedict(bookmark, pages):
    Creates page dictionary for PyPDF2 bookmark object.

    bookmark is a PDF object (dictionary).
    pages is a list of PDF page objects (dictionary).
    Returns two tuple of a dictionary and
    integer page number.
    page = matchpage(bookmark[PAGE].getObject(), pages)
    title = bookmark[TITLE]
    # One bookmark per page per level.
    lookupdict = OrderedDict()
    return lookupdict, page

def recursivepopulater(bookmark, pages):
    Fills in child nodes of bookmarks
    recursively and returns dictionary.
    dictx = OrderedDict()
    for pagex in bookmark:
        if type(pagex) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(pagex, pages)
        elif type(bookmark) is CHILD:
            newdict = OrderedDict()
            newdict.update(recursivepopulater(pagex, pages))
    return dictx

def makenewbookmarks(pages, bookmarks):
    Main function to generate bookmark dictionary:

    {page number: {name:<name>,
                   children:[<more bookmarks>]},
                   and so on.

    Returns dictionary.
    dictx = OrderedDict()
    # top level bookmarks
    # it's going to go bookmark, list, bookmark, list, etc.
    for bookmark in bookmarks:
        if type(bookmark) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(bookmark, pages)
        elif type(bookmark) is CHILD:
            dictx[page][CHILDREN] = recursivepopulater(bookmark, pages)
    return dictx

def printbookmarkaddition(name, parentname, depthlevel):
    Print notification of bookmark addition.

    Indentation based on integer depthlevel.
    name is the string name of the bookmark.
    parentname is the string name of the parent

    Side effect function.
    args = name, parentname, depthlevel
    indent = depthlevel * INDENT
    print(indent + ADDEDBOOKMARK.format(*args))

def dealwithbookmarks(comparisonreader, output, bookmarkdict, depthlevel, levelparent=None, parentname=None):
    Fix bookmarks so that they are properly
    placed in the new document with the blank
    pages removed. Recursive side effect function.

    comparisonreader is the PDF reader object
    for the original document.

    output is the PDF writer object for the
    final document.

    bookmarkdict is a dictionary of bookmarks.

    depthlevel is the depth inside the nested
    dictionary-list structure (0 is the top).

    levelparent is the parent bookmark.

    parentname is the name of the parent bookmark.
    depthlevel += 1
    for pagekeylevel in bookmarkdict:
        namelevel = bookmarkdict[pagekeylevel][NAME]
        levelparentii = output.addBookmark(namelevel, pagekeylevel, levelparent)
        if depthlevel == 0:
            parentname = TOPLEVEL
        printbookmarkaddition(namelevel, parentname, depthlevel)
        # dictionary
        secondlevel = bookmarkdict[pagekeylevel][CHILDREN]
        argsx = comparisonreader, output, secondlevel, depthlevel, levelparentii, namelevel
        # Recursive call.

def cullpages():
    Fix SSRS PDF dump by removing blank
    ssrsdump = open(INPUTFILE, 'rb')
    finalreport = open(OUTPUTFILE, 'wb')
    comparisonreader = pdf.PdfFileReader(ssrsdump)
    pageindices = getpages(comparisonreader)
    output = pdf.PdfFileWriter()
    # add pages from SSRS dump to new pdf doc
    for (old, new) in pageindices:
        print(ADDPAGE.format(old, new))
        pagex = comparisonreader.getPage(old)

    # Attempt to add bookmarks from original doc
    # getOutlines yields a list of nested dictionaries and lists:
    #    outermost list - starts with parent bookmark (dictionary)
    #        inner list - starts with child bookmark (dictionary)       
    #                     and so on
    # The SSRS dump and this list have bookmarks in correct order.
    bookmarks = comparisonreader.getOutlines()
    # Get page numbers using this methodology (indirect object references)
    # list of IndirectObject's of pages in order
    pages = [pagen.getObject() for pagen in
    # Bookmarks.
    # Top level is list of bookmarks.
    # List goes parent bookmark (Destination object)
    #               child bookmarks (list)
    #                   and so on.
    bookmarkdict = makenewbookmarks(pages, bookmarks)
    # Initial level of -1 allows increment to 0 at start.
    dealwithbookmarks(comparisonreader, output, bookmarkdict, -1)

    print('\n\nWriting final report . . .')

if __name__ == '__main__':

Sunday, August 31, 2014

Internet Explorer 9 Save Dialog - SendKeys Last Resort

At work we use Internet Explorer 9 on Windows 7 Enterprise.  SharePoint is the favored software for filesharing inside organizational groups.  Our mine planning office is in the States; the mine operation whose data I work is in a remote, poorly connected location of the world.

Recently Sharepoint was updated to a new version at the mine.  The SharePoint server configuration there no longer allows Windows Explorer view or mapping of the site to a Windows drive letter.  I've put in a trouble ticket to regain this functionality, but that may take a while if it's possible.  Without it it is difficult to automate file retrieval or get more than one file at a time.

In the meantime I've been able to get the text based files over using win32com automation in Python to run Internet Explorer and grab the innerHTML object.  innerHTML is essentially the text of the files with tags around it.  I rip out the tags, write the text to a file on my harddrive and I'm good to go.

Binary files proved to be more difficult to download.  Shown below is a screenshot of the Internet Explorer 9 dialog box that goes by the generic name Notification Bar:

I googled and could nowhere find how this thing fit into the Internet Explorer 9 Document object hierarchy.  Then I came upon this colorful exchange between Microsoft Certified MVP's from 2012 that made things a little more clear.
It turns out you can't access the Notification Bar programatically per se.  What you can do is activate the specific Internet Explorer window and tab you're interested in, then send keystrokes to get where you want to, click, and download your file.
I'm not a web programmer nor am I a dedicated Windows programmer (I'm actually a geologist).  IEC is a small module that wraps some useful functionality - in my case identifying and clicking on the link on the SharePoint page by it's text identifier:
# C Python 2.7
# Internet Explorer module.
import IEC as iec
import time
ie = iec.IEController()
ie.Navigate(<URL of SharePoint page>)
# Give the page time to load (7 seconds).
# I want to download file 11.msr.
# Give 5 seconds for the Notification Bar to show up.
I'm fortunate in that our mine planning vendor, MineSight, ships Python 2.7 and associated win32com packages along with their software (their API's are written for Python).  If you don't have win32com and friends installed, they are necessary for this solution.
At this point I've just got to deal with that pesky Internet Explorer 9 Notification Bar.  As it turns out, SendKeys makes it doable (although neither elegant nor robust :-(   ):
# Activate the SharePoint page.
from win32com.client import Dispatch as dispx
shell = dispx('WScript.Shell')
shell.AppActivate(<name of IE9 tab>)
# Little pause.
# Keyboard combination for the Notification Bar selection
# is ALT-N or '%n'
shell.SendKeys('%n', True)
# The Notification Bar goes to "Open" by default.
# You need to tab over to the "Save" button.
# Another little pause.
# Space bar clicks on this control.
shell.SendKeys(' ', True)
The key combinations for accessing the Notification Bar are in Microsoft's documentation here
One link showing use of SendKeys is a German site (mostly English text) here.
And that's pretty much it.  There's another dialog that pops up in Internet Explorer 9 after the file is downloaded.  I've been able to blow that off so far and it hasn't gotten in the way as I move to the next download.  I give these files (about 300 kb) 15 seconds to download over a slow connection.  I may have to adjust that.
This solution is an abomination by any coding/architecture/durability standard.  Still, it's the abomination that is getting the job done for the time being.
Thanks for stopping by.

Friday, March 28, 2014

Editing a PDF file with Python (with a little help from PDFTKBuilder)

I'm working with a report published with SQL Server Reporting Services (SSRS).  The report is located on a remote server in Africa.  It is inconvenient for management in North America to view the report and print it from a browser (slow connection, formatting issues).  Instead, management would like a PDF file of the report to be e-mailed out to a distribution list.

This post deals with taking the PDF dumped from the SSRS web report and cleaning it up for viewing and navigation (bookmarking).  I didn't know a great deal about PDF's before working on this.  My ignorance will probably be reflected in the terminology I use and my approach.  Nonetheless, the problem was a bit more involved than I anticipated.  My intent is to put my experience out there and, if I have made things harder than necessary, get some feedback in the comments.

I think it's fair to say that SSRS is not a mature product yet, but, in a Microsoft/Windows environment its usefulness trumps that.  The dump to PDF or Excel feature for reports is handy, but doesn't always yield an output format consistent with the SSRS web report.  The first problem I had was a "corrupt" PDF dump.  The file opens fine in Acrobat Reader, but doesn't behave well when one attempts to copy its contents with modifications to another file with PyPDF2 (this is just a straight copy of pages from one pdf file to another new one):

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> dumpfile = open('baddumpfromssrs.pdf', 'rb')
>>> reader = pdf.PdfFileReader(dumpfile)
>>> numpages = reader.getNumPages()
>>> numpages
>>> outputfile = open('testoutput.pdf', 'wb')
>>> writer = pdf.PdfFileWriter()
>>> for x in xrange(numpages):
...     writer.addPage(reader.getPage(x))
>>> writer.write(outputfile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\PyPDF2\", line 279, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "c:\python27\lib\site-packages\PyPDF2\", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\", line 352, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "c:\python27\lib\site-packages\PyPDF2\", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\", line 381, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "c:\python27\lib\site-packages\PyPDF2\", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\", line 372, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "c:\python27\lib\site-packages\PyPDF2\", line 1164, in getObject
    retval = readObject(, self)
  File "c:\python27\lib\site-packages\PyPDF2\", line 71, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\", line 587, in readFromStream
    value = readObject(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\", line 91, in readObject
    return NumberObject.readFromStream(stream)
  File "c:\python27\lib\site-packages\PyPDF2\", line 257, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''

Bummer.  I had to google around until something showed up on Stackoverflow.  There's a comment on the post that suggests the use of pdftk to "un-corrupt" the file.  To avoid having to have an admin install something on my work computer, I downloaded PDFTKBuilder Portable.  This is really overkill, because I didn't need the user interface to clean up the file.  There is an App folder in the pdftk portable install that has the pdftk command line tool:

C:\blogdoc>pdftk baddumpfromssrs.pdf output good.pdf

This "worked" in terms of preparing the file to be dealt with with Python and PyPDF2, but not before I opened it in Adobe Reader and closed it.  I'm working in a corporate environment under Windows 7.  I double checked to see that I had the same command in the command window (I used the up arrow key to recall it to issue the command that worked).  I don't know what's going on there.  The important thing was that I could proceed with the non-corrupt file good.pdf.

While I'm on things that weren't working, I should probably mention the Python 3/Python 2 thing.  This originally gave an error on Python 3; when I tried to reproduce the problem, it hung forever and I had to kill it with Ctrl-C:

Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:24:06) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> pagex = reader.getPage(0)
>>> pagex.extractText()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python34\lib\site-packages\PyPDF2\", line 2070, in extractText
    content = ContentStream(content, self.pdf)
  File "c:\python34\lib\site-packages\PyPDF2\", line 2153, in __init__
  File "c:\python34\lib\site-packages\PyPDF2\", line 2173, in __parseContentStream
    operator += tok

I'm a bit of a Python 3 advocate, sometimes even a zealot.  Still, pain won over my conviction and I switched to Python 2.7 where I got better results.  There is mention of this problem (with error) on StackOverflow.  A comment makes mention of replacing a couple PyPDF2 source files to make sure it runs with Python 3.  I couldn't find the link and took the expedient Python 2.7 route.

This is about where everything started to work the way it was supposed to.  Now I could get down to fixing the SSRS pdf report dump.  The first thing that needed to happen was the removal of a bunch of blank pages from the report.  On the SSRS web report they weren't there, but the PDF file had a blank page everywhere there was a page break.  Conveniently, this was every other page:

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> numpages = reader.getNumPages()
>>> numpages
>>> contentpages = (x for x in xrange(numpages) if x % 2 == 0)
>>> writer = pdf.PdfFileWriter()
>>> for n in contentpages:
...     pagex = reader.getPage(n)
...     writer.addPage(pagex)
>>> writer.write(outputfile)
>>> outputfile.close()
>>> inputfile.close()

Super!  Now I've got a 27 page document with content on every page.  The next thing I needed in my case was a banner or mark across each page saying, "DRAFT FORMAT."  The specific idea was that this was a sample report being circulated for comments and approval.

I didn't want super bold red text across the page, rather white text outlined in red.  Some googling paid off with a suggestion from a mailing list.  reportlab.pdfgen is the tool used for creating the file with the banner.  We'll merge it to the pages of the main report document later.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> # want red outline
>>> c.setStrokeColor((1, 0, 0))
>>> # inside of letters should be white
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> # origin is at bottom, left of page
>>> c.drawString(2 * 72, 7 * 72, 'DRAFT FORMAT')

Great, I've got a banner.

There is a whole bunch of stuff in that code segment that I'm leaving unexplained.  Not a big surprise, but to use a Python API to edit PDF's, you need to know something about the format.  This has been a huge learning experience over the course of a day or two for me.  What helped me most is the reportlab documentation.  After copying a code snippet and seeing that it worked, I could go back there and try to figure out how it works.  This learning experience is a work in progress.  There are things you pick up right away, though.  For instance, Adobe Reader comes with 14 base fonts of which Hevletica is one.  Who knew?  Not I!

My banner isn't quite the way I want it.  It's horizontal and I would like to tilt it to 45 degrees.  Google again to the rescue.  Some kind soul has already covered it on a blog.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT FORMAT')

Close enough.  Confession - I don't really think things through and measure with trigonometry what it will take to get placement right; I just "hack" until it looks about right.  This is a habit I should break if I continue to have to play with pdf's.

Now we'll merge the banner with some pages from another pdf to make a new document.  I'm going to use pages from the reportlab documentation because there's all kinds of work stuff in the pdf I generated above.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> bannerfile = open('banner.pdf', 'rb')
>>> docfile = open('docfile.pdf', 'rb')
>>> outputfile = open('newfile.pdf', 'wb')
>>> readerbanner = pdf.PdfFileReader(bannerfile)
>>> readerdoc = pdf.PdfFileReader(docfile)
>>> writernewdoc = pdf.PdfFileWriter()
>>> pagesdoc = (readerdoc.getPage(x) for x in xrange(286, 291))
>>> for pagen in pagesdoc:
...     writernewdoc.addPage(pagen)
>>> writernewdoc.write(outputfile)
>>> outputfile.close()
>>> docfile.close()
>>> # now merge banner to pages of new file
>>> opaquebannerfile = open('opaquebannerfile.pdf', 'wb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> bannerpage = readerbanner.getPage(0)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> writeropaquebanner = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeropaquebanner.addPage(pagex)
>>> writeropaquebanner.write(opaquebannerfile)
>>> opaquebannerfile.close()
>>> bannerfile.close()
>>> testpagefile.close()

It's not perfect, but it's essentially what I wanted (my centering of the banner could be better).

What if I wanted a transparent banner to emphasize the draft nature of the content rather than that of the format?

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('transparent.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> transparentwhite = canx.Color(255, 255, 255, alpha = 0.0)
>>> c.setFillColor(transparentwhite)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT')
>>> # merge again
>>> transparentbannerfile = open('transparent.pdf', 'rb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> outputfile = open('mergedtransparent.pdf', 'wb')
>>> import PyPDF2 as pdf
>>> readerbanner = pdf.PdfFileReader(transparentbannerfile)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> bannerpage = readerbanner.getPage(0)
>>> writeroutput = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeroutput.addPage(pagex)
>>> writeroutput.write(outputfile)
>>> outputfile.close()
>>> transparentbannerfile.close()
>>> testpagefile.close()

Not beautiful (I would make the banner font edge thinner), but it is indeed transparent.


The transparency part is the alpha value in the code for the color transparentwhite.  There is some sample code that shows how to do this on's site.

The last thing I needed to deal with was bookmarks.  I had some problems initially in that, although the bookmark showed up, it ended up at the bottom of the page underneath the SSRS tables and charts I was trying to reference.  I got around this by digging into the dictionary structure of the PyPDF2 Bookmark object.  Here is the (one line) function code:



def fixbookmark(bookmark):
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0

addBookmark is a method of the PyPDF2.PdfFileWriter object.  It takes a string name, a page index (zero based), and an optional parent PyPDF2 Bookmark object.  The references in my fixbookmark function "take" prior to writing the pdf to disk with the write method of the PyPDF2.PdfFileWriter object.

Mike Driscoll blogged about PyPDF2 a couple years back.  He's got a whole series on PDF's in fact (aside: the man is a pragmatic programming blogging machine).  There are good code snippets and pretty good comment threads on those posts for newbs like me.  I found that rl118.pdf doc useful for familiarizing myself with the pdf file format and constants used to reference objects within the file format.

This was a bit of an experience dump on my part.  If you've read this far, thank you for your patience and for having a look.

Tuesday, March 18, 2014

(Windows) LogParser - Install Without Admin Rights

A twitter acquaintaince @zippy1981 recommended the Window's software LogParser as a replacement for MSSQL bcp for my data transfer needs.  I downloaded the msi file from Microsoft and tried to install it.  As is true with a lot of software at work, I got a message saying the software can't be installed without admin rights.

I tweeted @zippy1981 (actually Justin Dearing in "real life") back saying I couldn't install.  He suggested using 7zip to decompress the msi file.  I downloaded 7zip portable and followed the instructions and ended up with files with names like these:


The names are not handy for the executable, nor will they work, but they are easy enough to decipher - there's an underscore between the extension and a period following the filename with a long string of characters.

Here is the mini, somewhat clunky script I wrote for "fixing" the filenames (I used Python 3.3):

Remove extensions from extracted
msi files.

DIRX = 'C://UserPrograms//LogParserWorking//'

import os
import shutil

filenames = []

x = os.walk(DIRX)
# generator
for y in x:
    # lists of files
    for filex in y[2]:

for filex in filenames:
    # rip off end
    # change _ to .
    # reverse
    filey = filex[-1::-1]
    # strip
    dotx = filey.find('.')
    filey = filey[dotx + 1:]
    # replace underscore
    underscore = filey.find('_')
    firstpart = filey[:underscore]
    firstpart += '.'
    secondpart = filey[underscore + 1:]
    filey = firstpart + secondpart
    filey = filey[-1::-1]
    shutil.move(DIRX + filex, DIRX + filey)

And voilá - I've got LogParser without having to bother our IT people for an install.
I'm probably late to the party on this msi extraction concept.  Still, I thought there might be other people who are as unaware of it as I was, so I'm blogging it.  Thanks for having a look.

Saturday, February 9, 2013

Generators Used as Counters

I've a bit late to the Python generator party.  My main problem in adoption was my lack of understanding of where I could apply the generator function concept.  I always wanted to apply this tool to counters in a counting based problem and I recently had an opportunity to do so.

Confession:  every time I do a counter based problem, I end up with off by one errors, so much so that I have to go through at least three iterations to get it right.

Generator functions used in part as counters are what got me through this.  The script was a mess and the use of the yield statement inside a couple functions cleaned it up for me to the point where it only took one more iteration to get it right (I had spent a weekend and about six iterations prior to this).

The main utility of the yield statement for me has three parts:

    1) it "remembers" where it is and where it stopped within the generator between next() calls.

    2) it "remembers" the values within itself (in this case a counter) between next() calls.

    3) because of 1) and 2) it's harder (for me, at least) to corrupt values and state than it is by passing around a counter variable and applying += operator calls.

The (somewhat convoluted) scenario:

I'm working at a mine in Africa.  The boss comes to me with a spreadsheet of about 500 items from various data sources and requests a database and a report that looks just like it and updates any time it's called.  Difficult, but not impossible.  I start on a schema for a database.  Considering I'm a geologist, it's not all bad:


Coworker comes by not long after and informs us that he is not keen on the idea of a custom database and that the problem should be done using SharePoint lists as lookup tables.  MS SQL 2008 r2 SSRS (SQL Server Reporting Services) is what we and our mine software vendor use for reporting.

I start to work on the SharePoint lists and hit a snag.  SSRS in its current state can only operate on a single dataset when you're presenting information in a table.  My main dataset in this case is the list of performance indicators or assays my boss asked for in the first place.  That list is in a SharePoint list.  Further, SSRS can only take one SharePoint list as a data source.  In other words, to the best of my knowledge, you can not write a relational query across SharePoint lists in SSRS 2008 r2.  My solution was to lookup values within my table in SSRS based on names or keys.  It got ugly and string concatenation was involved, but, amazingly, it worked.

The only real problem, apart from lousy performance, was the matching of shipment data from the mine based on mine levels as required in the spreadsheet.  What I needed to do was create a separate record in the main KPI list for each KPI-pit-material(roughly economic character of the rock)-bench-subbench combination.  There was a last column dealing with mining method, but for the time being I could get that from the subbench number.

Anyway, 4 or 5 data sources requiring elevations, 4 (currently, there will be more) open pits, 5 KPI's, 9 or 10 materials, about 40 benches, 9 subbenches - it ends up being a lot of records (5 * 4 * 5 * 10 * 40 * 9 ~= 360,000).  In the end I had about 350,000 records total.  I actually started doing this by hand in an Excel spreadsheet.  I got as far as replicating everything for the four pits and knew I would have to do the rest programmatically.  Knowing what I know now, I would have done the whole thing programmatically.  Live and learn.

Shown below is the code (sorry about the plain look - I'm not with it with Syntax Highlighter and other tools yet).  There are some naive constructs and hacks, but it works.  Thanks for having a look.

Deal with main list of KPI's in
Jeff's reconciliation report.
Deal with elevation, flitch, and mining method here.
Assign key values (integers) to
relevant columns.

import csv
import collections as colx
import os
import sys
import pprint

# ranges that require bench/flitch addition
import benchranges as br

# bench range file format:
# """
# Row ranges in <boss's> Excel spreadsheet
# that require repetition for bench and flitch.
# """
# RANGES = ((36, 42),
#           (42, 48),
#           (48, 54),
#           (54, 60),
#           (60, 66),
#           (66, 72),
#           (72, 78),
#           (78, 84),
#              .
#              .
#              .
#             etc.

DIRX = r'C:\MSPROJECTS\SiteTripJAN13\JeffDatabase\FirstAttemptWDatabase\LookupDataSetup'
DIRCSV = r'C:\MSPROJECTS\SiteTripJAN13\JeffDatabase\FirstAttemptWDatabase\LookupDataSetup\loadtosharepoint\spreadsheetsforload\cvsdumps'
CSV = '.csv'

KPIFIELDS = ['KPIID', 'KPIOrder', 'AttributeDescription', 'Units', 'DataType', 'Source', 'ValueCalc', 'Period', 'Location', 'Pit', 'Level', 'MiningMethod']


# account for extra level, flitch, and BHVermeer files
CSVFILES = KPIFIELDS[2:] + ['Benches', 'BHVermeer', 'Flitches']

MAINFILE = 'kpiids'

BIGNUMBER = 1000000

# OK, the only file that needs a named tuple is the KPI one.
# The others all have two columns where we're trying to match
# basic format of KPI lookup csv files (faux unit example):
# 1,Frobnostication unit
# 2,Foo unit
# 3,ounces
# 4,gallons per minute
#     .
#     .
#     .
#    etc.

# format of main input file:
# 1,1,Plant Harvest Salable Copper,Tonne Metal,Plant Actual,Mill Report,Value,Shift Day,Plant,Not Applicable,Not Applicable,Not Applicable
# 2,2,Plant Contained Salable Cobalt,Tonne Metal,Plant Actual,Mill Report,Value,Shift Day,Plant,Not Applicable,Not Applicable,Not Applicable
# 3,3,Mill Tonnage,Tonne Ore,Plant Actual,Mill Report,Value,Hour,Plant,Not Applicable,Not Applicable,Not Applicable
#     .
#     .
#     .
#    etc.

# intended format of output file:
# 1,1,120,10,4,2,2,3,3,3,2,42,10,2,4
# 2,2,119,10,4,2,2,3,3,3,2,42,10,2,4
# 3,3,108,11,4,2,2,1,3,3,2,42,10,2,4
#     .
#     .
#     .
#    etc.

# have to add slots for pit, bench, flitch, mining method
# 14 slots all integers
# XXX - very much hard coded
# 'KPIID', 'KPIOrder', 'AttributeDescription', 'Units'
# 'DataType', 'Source', 'ValueCalc', 'Period'
# 'Location', 'Pit', 'Level', Bench, Flitch'
# 'MiningMethod', BHVermeer

MAINFILELINEFMT = (11 * '{{{}}},').format(*[x for x in KPIFIELDS[:11]])
MAINFILELINEFMT += '{Bench},{Flitch},{MiningMethod},{BHVermeer}'

PIT = 'Pit'
BENCH = 'Bench'
FLITCH = 'Flitch'
NA = 'Not Applicable'

# specific for each item (1FEB2013)
EXCEL = 'excel'

def makelookupdict():
    Read the available csv files for the SharePoint
    lookup tables and make a corresponding dictionary
    of integer values corresponding to each value
    (string) key.

    Returns dictionary.
    # make lookup table of values
    lookupdict = {}
    for filex in CSVFILES:
        f = open(filex + CSV, 'r')
        lookupdict[filex] = {}
        readerx = csv.reader(f, dialect = EXCEL)
        for linex in readerx:
            keyx = linex[1]
            lookupdict[filex][keyx] = linex[0]
    return lookupdict

def newnum():
    Generator function that will continuously
    increment an integer and yield the new
    x = 1
    while 1:
        yield x
        x += 1

def initializerecord(recorddict, recordnt, lookupdict):
    Put corresponding integer lookup values
    for fields in dictionary.

    recorddict is an empty dictionary.

    recordnt is a named tuple of record values.

    lookupdict is a dictionary of lookup indices
    for the values in question.
    for namex in KPIFIELDS[2:]:
        # XXX - dictionary lookup is a bit hackish
        recorddict[namex] = lookupdict[namex][recordnt.__dict__[namex]]

def dummyfieldsforkpi(recorddict, n):
    Fills in Not Applicable for fields in KPI
    that doesn't require them (bench, flitch, etc.)

    recorddict is the dictionary for the KPI
    in question.

    n is the KPI identifier and order number
    of original KPI file.
    recorddict[BENCH] = NABENCH
    recorddict[FLITCH] = NAFLITCH
    recorddict[KPIID] = n
    recorddict[KPIORDER] = n

def dealwithbenchloop(n, counter, recorddicts, f2):
    Generator function.

    Fill in rows for benches and flitches for a given
    set of grouped KPI's.

    n is the current line number of the file being
    written to.

    counter is the generator for incrementing line
    numbers is the file being written to.

    recorddicts is the base list of grouped record

    f2 is the file being written to.

    Returns the position the line number counter
    for the new file.
    # loop through benches
    for benchx in range(1, 42):
        # loop through flitches, increment counter
        # 1-9 corresponds to flitch indices in original flitch list (1-9)
        for flitchx in range(1, 10):
            for recn in recorddicts:
                # newlineno will represent KPIID, KPIOrder, add to dict
                recn[KPIID] = n
                recn[KPIORDER] = n
                # match bench, flitch, bhvermeer with id's and add to dict
                recn[BENCH] = benchx
                recn[FLITCH] = flitchx
                if flitchx == 9:
                    recn[BHVERMEER] = 1
                    recn[BHVERMEER] = 2
                # write dictionary to file
                print(MAINFILELINEFMT.format(**recn), file = f2)
                n = next(counter)
                yield n

def cyclethroughkpis():
    Generator function that will cycle through
    a list of KPI (key performance indicator)
    records and write each one to a new file
    adding records for benches and flitches
    where appropriate.
    # get main file and assign lookup values
    f = open(MAINFILE + CSV, 'r')
    f2 = open('kpilookups.csv', 'w')
    # named tuple for cycling through file
    recordx = colx.namedtuple('recordx', KPIFIELDS)
    # counter for new file
    x = newnum()
    # map object of records
    # want line number and named tuple records in an iterable
    ziplines = zip(range(1, BIGNUMBER), map(recordx._make, csv.reader(f, dialect = EXCEL)))
    # lookup dictionary
    lookupdict = makelookupdict()
    # multiple bench original file line index ranges
    benchidxs = (range(*z) for z in br.RANGES)
    # increment new file counter
    n = next(x)
    while 1:
        recorddict = {}
            zipnt = next(ziplines)
        except StopIteration:
        initializerecord(recorddict, zipnt[1], lookupdict)
        # if this requires multiple repeated records for bench and flitch
        if int(recorddict[PIT]) != NAPIT:
            # list of grouped record dictionaries
            recorddicts = [recorddict]
            # get range of rows for grouped KPI's
            rowrange = next(benchidxs)
            # first item in rowrange already current
            for i in rowrange[1:]:
                zipnt = next(ziplines)
                recorddict = {}
                initializerecord(recorddict, zipnt[1], lookupdict)
            r = dealwithbenchloop(n, x, recorddicts, f2)
            # track new file counter
            for w in r:
                yield w
            # track counter
            n = w
            yield n, zipnt[0], rowrange
            # make record complete
            dummyfieldsforkpi(recorddict, n)
            # write to file
            print(MAINFILELINEFMT.format(**recorddict), file = f2)
            # yield current file counter, original file counter, and row range
            yield n, zipnt[0], None
            n = next(x)

if __name__ == '__main__':
    dorecords = cyclethroughkpis()
    while 1:
        except StopIteration:
            print('Finished run.')
            # XXX - a bit hackish

Monday, December 24, 2012

Larry Berry (Mintec) 1962 - 2012

This is another somewhat philosophical post.  I promise I will post code and tangible how-to stuff again.

A mentor of mine, the individual who turned me on to Python, has passed on.  The whole deal was a bit sudden and really gave me pause.  What I'll try to do here is briefly summarize Larry's body of work, and put it in perspective with regard to software development and domain specific software.  Although his efforts won't see headlines in either mining circles or software ones, what he did was pretty amazing.

Larry worked for Mintec, a mine software vendor, from 1990 on. His specialty and his passion were Ore Control, the place where mine operations and engineering come together, the place, figuratively speaking, where the rubber meets the road. This part of the mining process, in my opinion, was often the bastard child of most upper management. My impression was that at the time Larry began his career at Mintec not all, but many management types had the idea that if you just planned the mine right and threw a bunch of grunts into the pit with some equipment, everything would magically happen. Tactical execution and measurement of that process are important. Larry was one of the people who recognized this immediately and did something about it. He championed until his death the idea that Ore Control and short range planning were information intensive parts of the mining process that demanded an operational database and a programming API that allowed for dynamic process configuration.

Around the year 2000, Mintec released a planning database and a Python API. Larry was very much a software implementer at the time and not a developer. Nonetheless, his input was an important part of the feedback loop within the relatively small mine software company. He told me that they had been looking at Lua as an API. It did not particularly work for him and they settled on Python instead. By comparison, I've heard that one of the other software mining companies uses Perl for its API, and another one VBA. I'm biased, but I think that for ease of learning for the customer and internal productivity, Mintec made the right choice. They started out with Python 2.1, then, along with some changes to the API, switched to Python 2.2. Today we are using Python 2.5.

As a programmer, Larry was mixed. He was a miner first, and anything that got in the way of his goal was a mere distraction. His code was clean, but seldom bulletproof. I would call him an offensive programmer, the opposite of a defensive one. He was an upbeat optimist. I was and am a somewhat morose pessimist. Much to the entertainment of those working around us, we would argue about code. The Larryism that caused me the most heartburn was the variable declared inside a conditional block (Python is a dynamic language that allows you to do that) that under a corner case didn't get executed.

Larry: Well if your data were squared away it would work!

Me: LARRY! Our data are NEVER squared away!

The crown achievement of Larry's Mintec career, in my estimation, was not so much his code, but his driving through of the whole MineSight Operations package, basically the Ore Control software I described above, internal to Mintec and within the mining industry. I saw things from the client side where Larry was the vendor and I represented the mining company. I can attest to the blood, sweat, and tears involved. Spending weeks away from your family doing an implementation in a decidedly unergonomic, dusty mine office and debugging software over the phone on Christmas Day are things you can't pay someone enough to do. Larry was a true believer in his product and did what was necessary to make it happen.

Ironically, Larry was promoted to Project Manager for MineSight Operations about eight months before his death. He had arrived after all that hard work. I would tease him about spending his time in a corner office with the head of the company smoking cigars and sipping scotch. He laughed. The guy never stopped working. He was very much a type A personality. The corporate life would have been easier on him and his family. I don't know how well he would have liked it long term. He loved to be in the thick of things, fixing things, making things work beautifully.

The main lesson I can glean from Larry's life is that if you want to do something significant, liking what you do is not enough. You have to care intensely and passionately about what you're doing. He was one hell of an individual.

Thursday, December 13, 2012

Plan for Pycon 2013

I looked at the Pycon talks and events schedule and was a bit overwhelmed.  There's just too much awesome, which is, for me, a bad thing.  Basically I've got to narrow down a strategy and a scope for attending this thing or I'll just learn nothing about everything and come back to work in Arizona and Africa totally burned out from lack of sleep.

What I've tried to do is group things in categories that are relevant to what I do for a living or what interests me to the point I would make the time to actually work on them.  I then listed the talks in these categories (mine, not the conference organizers') on paper and ranked them within each category.
The pattern that fell out for me is that I would be best served by focusing on Python 3.3 and how to use it best.  For me, that's going to be the theme of the conference, even if there are other more prominent official themes.

Disclaimer:  I'm a geologist/wannabe mining engineer/sometimes programmer/sometimes developer who is trying to write decent code and maintainable software.  Among the Pycon crowd, I'm probably well into the lower 50th or 25th percentile in terms of computer literacy and programming skills.  Among mining professionals I'm probably somewhere in the upper 25th percentile in terms of those skills.  It's all relative.  What's good for me is probably not the same as what's good for you.  Hopefully this gives other people some idea of how they would like to plan their conference experience.

A) Core Language for the User

1) How to Except When You're Excepting - Nam
2) Become a logging expert in 30 minutes - G. Roy
3) Iteration & Generators: the Python Way - Ramalho
4) Loop like a native: while, for, iterators, generators - Batchelder
5) Python's Class Development Toolkit - Hettinger
6) Transforming Code into Beautiful, Idiomatic Python - Hettinger
7) Encapsulation with descriptors - Ramalho

This is all basic language stuff.  I need the repetition and reinforcement.  "Yeah, but Carl, TWO talks on iteration?!  Don't you think you're taking this whole repetition/looping thing too far?"  No.  When you do summing of tonnes and grade and weighted averages and binning of material types and truck counts and processing of millions of individual 3D geologic model blocks for a living, there are never enough ways or efficient enough programming language idioms.  iter(), like, um, forever . . .

B) Python Internals

1) The Guts of Unicode in Python - Peterson
2) All-Singing All-Dancing Python Bytecode - Hastings
3) How Import Works - Canon
4) sys._current_frames(): Take real-time x-rays of your software for fun and performance - Rochael

I'm probably going to actually understand about 25% of what I hear IF I actually do some research ahead of time.  Still, knowing how something actually works can be a huge advantage.

C) Miscellaneous Stuff

1) Why you should use Python 3 for text processing - Mertz
2) How the Internet works - McKellar
3) Internationalization and Localization Done Right - Varshney
4) Python for Robotics and Hardware Control - Foote
5) Realtime Tracking and Mapping of Geographic Objects using Python - Burhum
6) Location, Location, Location - Grace
7) The Magic of Metaprogramming - Rush
8) Integrating Jython with Java - Baker
9) Getting Python and Django Through Your Java Shop Front Door (with Jython) - Wierzbicki

All stuff that I've had to deal with in the workplace.  All stuff that I wished I had been smarter about :-\

D) Community

1) Scaling community diversity outreach - Laroia
2) Designers + Developers: Collaborating on your Python project - Elman
3) How (Not) To Build An OSS Community - Lindsley

People hacking is not my strong suit.  Anna Ravenscroft gave a talk at Pycon back in 2010 on diversity.  I had a couple "Huh, I never thought of that.  Yeah," moments during that talk.  If I can glean a handful more of those "Huh" moments from these talks, they are more than worth it.

E) Tutorials

1) A Gentle Introduction to Computer Vision - Scott
2) Python 3 Metaprogramming - Beazley

We use a fair bit of computer vision in mining and that will probably increase as we move into the future.  The main application I can think of is estimation of rock fragmentation size with SPLIT imagery.  I actually own a VB book that covers a lot of basic concepts in a programming context (Visual Basic Graphics Programming - Stephens).  The problem is that it's 1) in a dead language - VB6 and 2) not in Python.

Beazley's course is going to be sick.  I will be truly disappointed if my head doesn't explode along with everyone else's and there's not a mini zombie apocalypse inside the classroom ;-)

F) PyPgDay 2013 - San Francisco PostgreSQL User Group

I've been to these PostgreSQL days the day before OSCON a time or two.  They are generally fantastic.  I really like the PostgreSQL community and the rigor they put into making robust database software.  Berkus or Deckelman, or both should be there.  Good, smart folks!  I am looking forward to this.

F) Keynotes

1) Hettinger
2) McKellar

Confession:  I am not into keynotes.  Basically it's first thing in the morning, you're beat from travel.  You herd into a huge dark conference room with a zillion other cattle, er, people.  If the talk, like a movie, doesn't make you forget about how you feel it can be a bit of a soul crushing letdown.  Back in 2010 Guido changed the format of his keynote to a twitter Q & A.  I hated it; some people loved it.  Yes, I am an uptight inflexible old fart who could stand to loosen up.  Yes, Guido is an indisputably great guy.  That said, I don't think I'll re-Guido myself on the keynote end for some time to come.

Rasberry Pi is a wildly cool innovation.  Sadly, I still really haven't successfully hacked on the Arduino.  When people are mastering the great grandchild to the Rasberry Pi, I'll finally be getting started on the Arduino when I'm in a nursing home.  Same problem as I mentioned at the outset - too much awesomeness, too much of a good thing.

Hettinger is really in the zone on the Python language end of things and cool new ways to use it.  He tweets tips constantly and has an infectious enthusiasm for the language.  I don't think he's reached his ceiling with respect to innovation or burn out.  Safe bet.

McKellar is a Linux kernel engineer, if I understood her bio correctly.  So here's this woman who has really made it in the industry and isn't wanting for anything accomplishment-wise or employment related.  Nonetheless, she is taking the time to get involved in outreach.  To me that says passion.  I'd like to hear what she has to say, whether she's talking about kernel hacking or outreach.

G) Conference Prep

1) Install or build Python 3.3 and dink with it.
2) Get a real database with real world data that mean something to me set up on Postgresql on my OpenBSD laptop.
3) Brush up on Unicode and Python's Unicode capabilities in preparation for Peterson's talk.

That's about it.  See you at the con.