Friday, October 31, 2014

Gtk.TreeView (grid view) with mono, gtk-sharp, and IronPython

The post immediately prior to this one was an attempt to reproduce Windows.Forms Calendar controls in Gtk for cross platform (Windows/*nix) effective rendering.

This time I am attempting to get familiar with gtk-sharp/Gtk's version of a grid view - the Gtk.TreeView object.  Some of the gtk-sharp documentation suggests the NodeView object would be easier to use.  I had some trouble instantiating the objects associated with the NodeView and went with the TreeView instead in the hopes of getting more control.

The Windows.Forms GridView I did years ago is here.  It became apparent to me shortly after embarking on this journey that I would be hard pressed to recreate all the functionality of that script in a timely manner.  I settled for a tabular view of drillhole data (fabricated, mock data) with some custom formatting.

Aside:  this is typically how mineral exploration drillhole data (core, reverse circulation drilling) is presented in tabular format - a series of from-to intervals with assay values.  Assuming the assays are all separate elements, the reported weight percents should not sum more than 100%, and never do unless someone fat fingers a decimal place.  I've projected a couple screaming hot polymetallic drill holes that end near surface (lack of funding for drilling), but show enough promise that the new mining town of Trachteville (the drill hole name CBT-BNZA stands for CBT-Bonanza) will spring up there at any moment . . . one can dream.

The data store object for the grid view Gtk.ListStore object would not instantiate in IronPython.  I was not the only person to have experienced this problem (I cannot locate the link to the mailing list thread or forum reference, but like the big fish that got away, I swear I saw it).  I didn't want to drop the effort just because of that, so I hacked and compiled some C# code:

public class storex
{
    public Gtk.ListStore drillhole =
                            // 7 columns
                            // drillhole id
          new Gtk.ListStore (typeof (string),
                            // from
                            typeof (double),
                            // to
                            typeof (double),
                            // assay1
                            typeof (double),
                            // assay2
                            typeof (double),
                            // assay3
                            typeof (double),
                            // assay4
                            typeof (double));
}


The mono command on Windows was

C:\UserPrograms\Mono-3.2.3>mcs -pkg:gtk-sharp-2.0 /target:library C:\UserPrograms\IronPythonGUI\storex.cs

Those are my file paths; locations depend on where you install things like mono and IronPython.

Anyway, I got my dll and I was off to the races.  Getting to know the Gtk and gtk-sharp object model proved challenging for me.  I'm glad I got some familiarity with it, but it would take me longer to do something in Gtk than it did with Windows.Forms.  The most fun and gratifying part of the project was getting the custom formatting to work with a Gtk.TreeCellDataFunc.  I used a function that yielded specific functions for each column - something that's really easy to do in Python.

Anyway, here are a couple screenshots and the IronPython code:





The OpenBSD one below turned out pretty good, but the Windows one had a little double line underneath the first row - it looked as though it was still trying to select that row when I told it specifically not to.  I'm not a design perfectionist Steve Jobs type, but niggling nits like that drive me batty.  For now, though it's best I publish the code and move on.

#!/usr/local/bin/mono /home/carl/IronPython-2.7.4/ipy64.exe

import clr

GTKSHARP = 'gtk-sharp'
PANGO = 'pango-sharp'

# Mock store C#
STOREX = 'storex'

clr.AddReference(GTKSHARP)
clr.AddReference(PANGO)

# C# module compiled for this project.
# Problems with Gtk.ListStore in IronPython.
clr.AddReference(STOREX)

import Gtk
import Pango

import storex

TITLE = 'Gtk.TreeView Demo (Drillholes)'
MARKUP = '<span font="Courier New" size="14" weight="bold">{:s}</span>'
MARKEDUPTITLE = MARKUP.format(TITLE)

CENTERED = 0.5
RIGHT = 1.0

WINDOWWIDTH = 350

COURFONTREGULAR = 'Courier New 12'
COURFONTBOLD = 'Courier New Bold 12'

DHNAME = 'DH_CBTBNZA-{:>02d}'
DHNAMELABEL = 'drillhole'
FROM = 'from'
TO = 'to'
ASSAY1 = 'assay1'
ASSAY2 = 'assay2'
ASSAY3 = 'assay3'
ASSAY4 = 'assay4'

FP1FMT = '{:>5.1f}'
FP2FMT = '{:>4.2f}'

DHDATAX = {(DHNAME.format(1), 0.0):{TO:8.7,
                                    ASSAY1:22.27,
                                    ASSAY2:4.93,
                                    ASSAY3:18.75,
                                    ASSAY4:35.18},
           (DHNAME.format(1), 8.7):{TO:15.3,
                                    ASSAY1:0.27,
                                    ASSAY2:0.09,
                                    ASSAY3:0.03,
                                    ASSAY4:0.22},
           (DHNAME.format(1), 15.3):{TO:25.3,
                                     ASSAY1:2.56,
                                     ASSAY2:11.34,
                                     ASSAY3:0.19,
                                     ASSAY4:13.46},
           (DHNAME.format(2), 0.0):{TO:10.0,
                                    ASSAY1:0.07,
                                    ASSAY2:1.23,
                                    ASSAY3:4.78,
                                    ASSAY4:5.13},
           (DHNAME.format(2), 10.0):{TO:20.0,
                                     ASSAY1:44.88,
                                     ASSAY2:12.97,
                                     ASSAY3:0.19,
                                     ASSAY4:0.03}}

FIELDS = [DHNAMELABEL, FROM, TO, ASSAY1, ASSAY2, ASSAY3, ASSAY4]
BOLDEDCOLUMNS = [DHNAMELABEL, FROM, TO]
NONKEYFIELDS = FIELDS[2:]

BLAZINGCUTOFF = 10.0

def genericfloatformat(floatfmt, index):
    """
    For cell formatting in Gtk.TreeView.

    Returns a function to format floats
    and to format floats' foreground color
    based on cutoff value.

    floatfmt is a format string.

    index is an int that indicates the
    column being formatted.
    """
    def setfloatfmt(treeviewcolumn, cellrenderer, treemodel, treeiter):
        cellrenderer.Text = floatfmt.format(treemodel.GetValue(treeiter, index))
        # If it is one of the assay value columns.
        # XXX - not generic.
        if index > 2:
            if treemodel.GetValue(treeiter, index) > BLAZINGCUTOFF:
                cellrenderer.Foreground = 'red'
            else:
                cellrenderer.Foreground = 'black'
    return Gtk.TreeCellDataFunc(setfloatfmt)

class TreeViewTest(object):
    def __init__(self):
        Gtk.Application.Init()
        self.window = Gtk.Window('')
        # DeleteEvent - copied from Gtk demo on internet.
        self.window.DeleteEvent += self.DeleteEvent
        # Frame property provides a frame and title.
        self.frame = Gtk.Frame(MARKEDUPTITLE)
        self.tree = Gtk.TreeView()
        self.tree.EnableGridLines = Gtk.TreeViewGridLines.Both
        self.frame.Add(self.tree)

        # Fonts for formatting.
        self.fdregular = Pango.FontDescription.FromString(COURFONTREGULAR)
        self.fdbold = Pango.FontDescription.FromString(COURFONTBOLD)

        # C# module
        self.store = storex().drillhole

        self.makecolumns()
        self.adddata()
        self.tree.Model = self.store

        self.formatcolumns()
        self.formatcells()
        self.prettyup()

        self.window.Add(self.frame)
        self.window.ShowAll()
        # Keep text viewable - size no smaller than intended.
        self.window.AllowShrink = False
        # XXX - hack to keep lack of gridlines on edges of
        #       table from showing.
        self.window.AllowGrow = False
        # Unselect everything for this demo.
        self.tree.Selection.UnselectAll()
        Gtk.Application.Run()

    def makecolumns(self):
        """
        Fill in columns for TreeView.
        """
        self.columns = {}
        for fieldx in FIELDS:
            self.columns[fieldx] = Gtk.TreeViewColumn()
            self.columns[fieldx].Title = fieldx
            self.tree.AppendColumn(self.columns[fieldx])

    def formatcolumns(self):
        """
        Make custom labels for columnn headers.

        Get each column properly justified (all
        are right justified,floating point numbers
        except for the drillhole 'number' -
        actually a string).
        """
        self.customlabels = {}

        for fieldx in FIELDS:
            # This centers the labels at the top.
            self.columns[fieldx].Alignment = CENTERED
            self.customlabels[fieldx] = Gtk.Label(self.columns[fieldx].Title)
            self.customlabels[fieldx].ModifyFont(self.fdbold)
            # 120 is about right for from, to, and assay columns.
            self.columns[fieldx].MinWidth = 120
            self.customlabels[fieldx].ShowAll()
            self.columns[fieldx].Widget = self.customlabels[fieldx]
            # ShowAll required for new label to take.
            self.columns[fieldx].Widget.ShowAll()

    def formatcells(self):
        """
        Add and format cell renderers.
        """
        self.cellrenderers = {}

        for fieldx in FIELDS:
            self.cellrenderers[fieldx] = Gtk.CellRendererText()
            self.columns[fieldx].PackStart(self.cellrenderers[fieldx], True)
            # Drillhole 'number' (string)
            if fieldx == FIELDS[0]:
                self.cellrenderers[fieldx].Xalign = CENTERED
                self.columns[fieldx].AddAttribute(self.cellrenderers[fieldx],
                        'text', 0)
            else:
                self.cellrenderers[fieldx].Xalign = RIGHT
                try:
                    self.columns[fieldx].AddAttribute(self.cellrenderers[fieldx],
                            'text', FIELDS.index(fieldx))
                except ValueError:
                    print('\n\nProblem with field definitions; field not found.\n\n')
        for fieldx in BOLDEDCOLUMNS:
            self.cellrenderers[fieldx].Font = COURFONTBOLD
        self.columns[fieldx].Widget.ShowAll()

        # XXX - not very generic, but better than doing them one by one.
        # from, to columns.
        for x in xrange(1, 3):
            self.columns[FIELDS[x]].SetCellDataFunc(self.cellrenderers[FIELDS[x]],
                    genericfloatformat(FP1FMT, x))
        # assay<x> columns.
        for x in xrange(3, 7):
            self.columns[FIELDS[x]].SetCellDataFunc(self.cellrenderers[FIELDS[x]],
                    genericfloatformat(FP2FMT, x))

    def usemarkup(self):
        """
        Refreshes UseMarkup property on widgets (labels)
        so that they display properly and without
        markup text.
        """
        # Have to refresh this property each time.
        self.frame.LabelWidget.UseMarkup = True

    def prettyup(self):
        """
        Get Gtk objects looking the way we
        intended.
        """
        # Try to get Courier New on treeview.
        self.tree.ModifyFont(self.fdregular)
        # Get rid of line.
        self.frame.Shadow = Gtk.ShadowType.None
        self.usemarkup()

    def adddata(self):
        """
        Put data into store.
        """
        # XXX - difficulty figuring out sorting
        #       function for TreeView.  Hack it
        #       with dictionary here.
        keytuples = [key for key in DHDATAX]
        keytuples.sort()
        datax = []
        for tuplex in keytuples:
            # XXX - side effect comprehension.
            #       Not great for readability,
            #       but compact.
            [datax.append(x) for x in tuplex]
            for fieldx in NONKEYFIELDS:
                datax.append(DHDATAX[tuplex][fieldx])
            self.store.AppendValues(*datax)
            # Reinitiialize data row list.
            datax = []

    def DeleteEvent(self, widget, event):
        Gtk.Application.Quit()

if __name__ == '__main__':
    TreeViewTest()


Thanks for stopping by.

Thursday, October 30, 2014

Mono gtk-sharp IronPython CalendarView

A number of years ago I did a post on the IronPython Cookbook site about the Windows.Forms Calendar control.  I could never get the thing to render nicely on *nix operating systems (BSD family).  It sounds as though Windows.Forms development for mono (and in general) is kind of dead, so there is not much hope that solution/example will ever render nicely on *nix.  Recently I've been playing with mono and decided to give gtk-sharp a shot with IronPython.

Quick disclaimers:

1) I suspect from the examples I've seen on the internet that PyGtk is a little easier to deal with than gtk-sharp.  That's OK; I wanted to use IronPython and have the rest of the mono/dotNet framework available, so I went through the extra trouble to forego CPython and PyGtk and go with IronPython and gtk-sharp instead.

2) The desktop is not the most cutting edge or sexy platform in 2014.  Nonetheless, where I work it is alive and well.  When I no longer see engineers hacking solutions in Excel and VBA, I'll consider the possibility of outliving the desktop.  Right now I'm not hopeful :-\

The results aren't bad, at least as far as rendering goes.  I couldn't get the Courier font to take on OpenBSD, but the Gtk Calendar control looks acceptable.  All in all, I was OK with the results on both Windows and OpenBSD.  I've heard Gtk doesn't do quite as well on Apple products, but I don't own a Mac to test with.  Here are a couple screenshots:






I run the cwm window manager on OpenBSD and have it set up to cut out borders on windows, hence the more minimalist look to the control there.

IronPython output on *nix has always come out in yellow or white - it doesn't show up on a white background, which I prefer.  In order to get around this, I run an xterm with a black background:

xterm -bg black -fg white

Here is the code for the gtk-sharp Gtk.Calendar control:

#!/usr/local/bin/mono /home/carl/IronPython-2.7.4/ipy64.exe

import clr

GTKSHARP = 'gtk-sharp'
PANGO = 'pango-sharp'

clr.AddReference(GTKSHARP)
clr.AddReference(PANGO)

import Gtk
import Pango

import datetime

TITLE = 'Gtk.Calendar Demo'
MARKUP = '<span font="Courier New" size="14" weight="bold">{:s}</span>'
MARKEDUPTITLE = MARKUP.format(TITLE)

INFOMSG = '<span font="Courier New 12">\n\n Program set to run for:\n\n '
INFOMSG += '{:%Y-%m-%d}\n\n</span>'

DATEDIFFMSG = '<span font="Courier New 12">\n\n '
DATEDIFFMSG += 'There are {0:d} days between the\n'
DATEDIFFMSG += ' beginning of the epoch and\n'
DATEDIFFMSG += ' {1:%Y-%m-%d}.\n\n</span>'

ALIGNMENTPARAMS = (0.0, 0.5, 0.0, 0.0)

WINDOWWIDTH = 350

CALENDARFONT = 'Courier New Bold 12'


class CalendarTest(object):
    inthebeginning = datetime.datetime.fromtimestamp(0)
    # Debug info - make sure beginning of epoch really
    #              is +midnight, Jan 1, 1970 GMT.
    print(inthebeginning)
    def __init__(self):
        Gtk.Application.Init()
        self.window = Gtk.Window(TITLE)
        # DeleteEvent - copied from Gtk demo on internet.
        self.window.DeleteEvent += self.DeleteEvent
        # Frame property provides a frame and title.
        self.frame = Gtk.Frame(MARKEDUPTITLE)
        self.calendar = Gtk.Calendar()
        # Handles date selection event.
        self.calendar.DaySelected += self.dateselect
        # Sets up text for labels.
        self.getcaltext()
        # Puts little box around text.
        self.datelabelframe = Gtk.Frame()
        # Try to get datelabel to align with other label.
        self.datelabelalignment = Gtk.Alignment(*ALIGNMENTPARAMS)
        self.datelabel = Gtk.Label(self.caltext)
        self.datelabelalignment.Add(self.datelabel)
        self.datelabelframe.Add(self.datelabelalignment)
        # Puts little box around text.
        self.datedifflabelframe = Gtk.Frame()
        self.datedifflabelalignment = Gtk.Alignment(*ALIGNMENTPARAMS)
        self.datedifflabel = Gtk.Label(self.timedifftext)
        self.datedifflabelalignment.Add(self.datedifflabel)
        self.datedifflabelframe.Add(self.datedifflabelalignment)
        self.vbox = Gtk.VBox()
        self.vbox.PackStart(self.datelabelframe)
        self.vbox.PackStart(self.datedifflabelframe)
        self.vbox.PackStart(self.calendar)
        self.frame.Add(self.vbox)
        self.window.Add(self.frame)
        self.prettyup()
        self.window.ShowAll()
        # Keep text viewable - size no smaller than intended.
        self.window.AllowShrink = False
        Gtk.Application.Run()

    def getcaltext(self):
        """
        Get messages for run date.
        """
        # Calendar month is 0 based.
        yearmonthday = self.calendar.Year, self.calendar.Month + 1, self.calendar.Day
        chosendate = datetime.datetime(*yearmonthday)
        self.caltext = INFOMSG.format(chosendate)
        # For reporting of number of days since beginning of epoch.
        timediff = chosendate - CalendarTest.inthebeginning
        self.timedifftext = DATEDIFFMSG.format(timediff.days, chosendate)

    def usemarkup(self):
        """
        Refreshes UseMarkup property on widgets (labels)
        so that they display properly and without
        markup text.
        """
        # Have to refresh this property each time.
        self.frame.LabelWidget.UseMarkup = True
        self.datelabel.UseMarkup = True
        self.datedifflabel.UseMarkup = True

    def prettyup(self):
        """
        Get Gtk objects looking the way we
        intended.
        """
        # Try to make frame wider.
        # XXX
        # Works nicely on Windows - try on Unix.
        # Allows bold, etc.
        self.usemarkup()
        self.frame.SetSizeRequest(WINDOWWIDTH, -1)
        # Get rid of line in middle of text on title.
        self.frame.Shadow = Gtk.ShadowType.None
        # Try to get Courier New on calendar.
        fd = Pango.FontDescription.FromString(CALENDARFONT)
        self.calendar.ModifyFont(fd)
        self.datelabel.Justify = Gtk.Justification.Left
        self.datedifflabel.Justify = Gtk.Justification.Left
        self.window.Title = ''
        self.usemarkup()

    def dateselect(self, widget, event):
        self.getcaltext()
        self.datelabel.Text = self.caltext
        self.datedifflabel.Text = self.timedifftext
        self.prettyup()

    def DeleteEvent(self, widget, event):
        Gtk.Application.Quit()

if __name__ == '__main__':
    CalendarTest()


Thanks for stopping by. 

Monday, October 20, 2014

subprocess.Popen() or Abusing a Home-grown Windows Executable

Each month I redo 3D block model interpolations for a series of open pits at a distant mine.  Those of you who follow my twitter feed often see me tweet, "The 3D geologic block model interpolation chuggeth . . ."  What's going on is that I've got all the processing power maxed out dealing with millions of model blocks and thousands of data points.  The machine heats up and with the fan sounds like a DC-9 warming up before flight.

All that said, running everything roughly in parallel is more efficient time-wise than running it sequentially.  An hour of chugging is better than four.  The way I've been doing this is using the Python (2.7) subprocess module's Popen method, running my five interpolated values in parallel.  Our Python programmer Lori originally wrote this to run in sequence for a different set of problems.  I bastardized it for my own.

The subprocess part of the code is relatively straightforward.  Function startprocess() in my code covers that.

What makes this problem a little more challenging:

1) it's a vendor supplied executable we're dealing with . . . without an API or source . . . that's interactive (you can't feed it the config file path; it asks for it).  This results in a number of time.sleep() and <process>.stdin.write() calls that can be brittle.

2) getting the processes started, as I just mentioned, is easy.  Finding out when to stop, or kill them, requires knowledge of the app and how it generates output.  I've gone for an ugly, but effective check of report file contents.

3) while waiting for the processes to finish their work, I need to know things are working and what's going on.  I've accomplished this by reporting the data files' sizes in MB.

4) the executable isn't designed for a centralized code base (typically all scripts are kept in a folder for the specific project or pit), so it only allows about 100 character columns in the file paths sent to it.  I've omitted this from my sanitized version of the code, but it made things even messier than they are below.  Also, I don't know if all Windows programs do this, but the paths need to be inside quotes - the path kept breaking on the colon (:) when not quoted.

Basically, this is a fairly ugly problem and a script that requires babysitting while it runs.  That's OK; it beats the alternative (running it sequentially while watching each run).  I've tried to adhere to DRY (don't repeat yourself) as much as possible, but I suspect this could be improved upon.

The reason why I blog it is that I suspect there are other people out there who have to do the same sort of thing with their data.  It doesn't have to be a mining problem.  It can be anything that requires intensive computation across voluminous data with an executable not designed with a Python API.

Notes: 

1) I've omitted the file multirunparameters.py that's in an import statement.  It has a bunch of paths and names that are relevant to my project, but not to the reader's programming needs.

2) python 2.7 is listed at the top of the file as "mpython."  This is the Python that our mine planning vendor ships that ties into their quite capable Python API.  The executable I call with subprocess.Popen() is a Windows executable provided by a consultant independent of the mine planning vendor.  It just makes sense to package this interpolation inside the mine planning vendor's multirun (~ batch file) framework as part of an overall working of the 3D geologic block model.  The script exits as soon as this part of the batch is complete.  I've inserted a 10 second pause at the end just to allow a quick look before it disappears.

#!C:/MineSight/x64/mpython

"""
Interpolate grades with <consultant> program
from text files.
"""


import argparse
import subprocess as subx
import os
import collections as colx

import time
from datetime import datetime as dt


# Lookup file of constants, pit names, assay names, paths, etc.
import multirunparameters as paramsx


parser = argparse.ArgumentParser()
# 4 letter argument like 'kwat'
# Feed in at command line.
parser.add_argument('pit', help='four letter, lower case pit abbreviation (kwat)', type=str)
args = parser.parse_args()
PIT = args.pit


pitdir = paramsx.PATHS[PIT]
pathx = paramsx.BASEPATH.format(pitdir)
controlfilepathx = paramsx.CONTROLFILEPATH.format(pitdir)


timestart = dt.now()
print(timestart)


PROGRAM = 'C:/MSPROJECTS/EOMReconciliation/2014/Multirun/AllPits/consultantprogram.exe'

ENDTEXT = 'END <consultant> REPORT'

# These names are the only real difference between pits.
# Double quote is for subprocess.Popen object's stdin.write method
# - Windows path breaks on colon without quotes.
ASSAY1DRIVER = 'KDriverASSAY1{:s}CBT.csv"'.format(PIT)
ASSAY2DRIVER = 'KDriverASSAY2{:s}CBT.csv"'.format(PIT)
ASSAY3DRIVER = 'KDriverASSAY3_{:s}CBT.csv"'.format(PIT)
ASSAY4DRIVER = 'KDriverASSAY4_{:s}CBT.csv"'.format(PIT)
ASSAY5DRIVER = 'KDriverASSAY5_{:s}CBT.csv"'.format(PIT)


RETCHAR = '\n'

ASSAY1 = 'ASSAY1'
ASSAY2 = 'ASSAY2'
ASSAY3 = 'ASSAY3'
ASSAY4 = 'ASSAY4'
ASSAY5 = 'ASSAY5'


NAME = 'name'
DRFILE = 'driver file'
OUTPUT = 'output'
DATFILE = 'data file'
RPTFILE = 'report file'


# data, report files
ASSAY1K = 'ASSAY1K.csv'
ASSAY1RPT = 'ASSAY1.RPT'

ASSAY2K = 'ASSAY2K.csv'
ASSAY2RPT = 'ASSAY2.RPT'

ASSAY3K = 'ASSAY3K.csv'
ASSAY3RPT = 'ASSAY3.RPT'

ASSAY4K = 'ASSAY4K.csv'
ASSAY4RPT = 'ASSAY4.RPT'

ASSAY5K = 'ASSAY5K.csv'
ASSAY5RPT = 'ASSAY5.RPT'


OUTPUTFMT = '{:s}output.txt'

ASSAYS = {1:{NAME:ASSAY1,
             DRFILE:controlfilepathx + ASSAY1DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY1),
             DATFILE:pathx + ASSAY1K,
             RPTFILE:pathx + ASSAY1RPT},
          2:{NAME:ASSAY2,
             DRFILE:controlfilepathx + ASSAY2DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY2),
             DATFILE:pathx + ASSAY2K,
             RPTFILE:pathx + ASSAY2RPT},
          3:{NAME:ASSAY3,
             DRFILE:controlfilepathx + ASSAY3DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY3),
             DATFILE:pathx + ASSAY3K,
             RPTFILE:pathx + ASSAY3RPT},
          4:{NAME:ASSAY4,
             DRFILE:controlfilepathx + ASSAY4DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY4),
             DATFILE:pathx + ASSAY4K,
             RPTFILE:pathx + ASSAY4RPT},
          5:{NAME:ASSAY5,
             DRFILE:controlfilepathx + ASSAY5DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY5),
             DATFILE:pathx + ASSAY5K,
             RPTFILE:pathx + ASSAY5RPT}}


DELFILE = 'delete file'
INTERP = 'interp'
SLEEP = 'sleep'
MSGDRIVER = 'message driver'
MSGRETCHAR = 'message return character'
FINISHED1 = 'finished one assay'
FINISHEDALL = 'finished all interpolations'
TIMEELAPSED = 'time elapsed'
FILEEXISTS = 'report file exists'
DATSIZE = 'data file size'
DONE = 'number interpolations finished'
DATFILEEXIST = 'data file not yet there'
SIZECHANGE = 'report file changed size'


# for converting to megabyte file size from os.stat()
BITSHIFT = 20

# sleeptime - 5 seconds
SLEEPTIME = 5

FINISHED = 'finished'
RPTFILECHSIZE = """
        
Report file for {:s}
changed size; killing process . . .

"""

MESGS = {DELFILE:'\n\nDeleting {} . . .\n\n',
         INTERP:'\n\nInterpolating {:s} . . .\n\n',
         SLEEP:'\nSleeping 2 seconds . . .\n\n',
         MSGDRIVER:'\n\nWriting driver file name to stdin . . .\n\n',
         MSGRETCHAR:'\n\nWriting retchar to stdin for {:s} . . .\n\n',
         FINISHED1:'\n\nFinished {:s}\n\n',
         FINISHEDALL:'\n\nFinished interpolation.\n\n',
         TIMEELAPSED:'\n\n{:d} elapsed seconds\n\n',
         FILEEXISTS:'\n\nReport file for {:s} exists . . .\n\n',
         DATSIZE:'\n\nData file size for {:s} is now {:d}MB . . .\n\n',
         DONE:'\n\n{:d} out of {:d} assays are finished . . .\n\n',
         DATFILEEXIST:"\n\n{:s} doesn't exist yet . . .\n\n",
         SIZECHANGE:RPTFILECHSIZE}


def cleanslate():
    """
    Delete all output files prior to interpolation
    so that their existence can be tracked.
    """
    for key in ASSAYS:
        files = (ASSAYS[key][DATFILE],
                 ASSAYS[key][RPTFILE],
                 ASSAYS[key][OUTPUT])
        for filex in files:
            print(MESGS[DELFILE].format(filex))
            if os.path.exists(filex) and os.path.isfile(filex):
                os.remove(filex)
    return 0


def startprocess(assay):
    """
    Start <consultant program> run for given interpolation.

    Return subprocess.Popen object,
    file object (output file).
    """
    print(MESGS[INTERP].format(ASSAYS[assay][NAME]))
    # XXX - I hate time.sleep - hack
    # XXX - try to re-route standard output so that
    #       it's not all jumbled together.
    print(MESGS[SLEEP])
    time.sleep(2)
    # output file for stdout
    f = open(ASSAYS[assay][OUTPUT], 'w')
    procx = subx.Popen('{0}'.format(PROGRAM), stdin=subx.PIPE, stdout=f)
    print(MESGS[SLEEP])
    time.sleep(2)
    # XXX - problem, starting up Excel CBT 22JUN2014
    #       Ah - this is what happens when the <software usb licence>
    #            key is not attached :-(
    print(MESGS[MSGDRIVER])
    print('\ndriver file = {:s}\n'.format(ASSAYS[assay][DRFILE]))
    procx.stdin.write(ASSAYS[assay][DRFILE])
    print(MESGS[SLEEP])
    time.sleep(2)
    # XXX - this is so jacked up -
    #       no idea what is happening when
    print(MESGS[MSGRETCHAR].format(ASSAYS[assay][NAME]))
    procx.stdin.write(RETCHAR)
    print(MESGS[SLEEP])
    time.sleep(2)
    print(MESGS[MSGRETCHAR].format(ASSAYS[assay][NAME]))
    procx.stdin.write(RETCHAR)
    print(MESGS[SLEEP])
    time.sleep(2)
    return procx, f


def crosslookup(assay):
    """
    From assay string, get numeric
    key for ASSAYS dictionary.

    Returns integer.
    """
    for key in ASSAYS:
        if assay == ASSAYS[key][NAME]:
            return key
    return 0


def checkprocess(assay, assaydict):
    """
    Check to see if assay
    interpolation is finished.

    assay is the item in question
    (ASSAY1, ASSAY2, etc.).

    assaydict is the operating dictionary
    for the assay in question.

    Returns True if finished.
    """
    # Report file indicates process finished.
    assaykey = crosslookup(assay)
    rptfile = ASSAYS[assaykey][RPTFILE]
    datfile = ASSAYS[assaykey][DATFILE]
    if os.path.exists(datfile) and os.path.isfile(datfile):
        # Report size of file in MB.
        datfilesize = os.stat(datfile).st_size >> BITSHIFT
        print(MESGS[DATSIZE].format(assay, datfilesize))
    else:
        # Doesn't exist yet.
        print(MESGS[DATFILEEXIST].format(datfile))
    if os.path.exists(rptfile) and os.path.isfile(rptfile):
        # XXX - not the most efficient way,
        #       but this checking the file appears
        #       to work best.
        f = open(rptfile, 'r')
        txt = f.read()
        f.close()
        # XXX - hack - gah.
        if txt.find(ENDTEXT) > -1:
            # looking for change in reportfile size
            # or big report file
            print(MESGS[SIZECHANGE].format(assay))
            print(MESGS[SLEEP])
            time.sleep(2)
            return True
    return False


PROCX = 'process'
OUTPUTFILE = 'output file'


# Keeps track of files and progress of <consultant program>.
opdict = colx.OrderedDict()


# get rid of preexisting files
cleanslate()


# start all five roughly in parallel
# ASSAYS keys are numbers
for key in ASSAYS:
    # opdict - ordered with assay names as keys
    namex = ASSAYS[key][NAME]
    opdict[namex] = {}
    assaydict = opdict[namex]
    assaydict[PROCX], assaydict[OUTPUTFILE] = startprocess(key)
    # Initialize active status of process.
    assaydict[FINISHED] = False


# For count.
numassays = len(ASSAYS)
# Loop until all finished.
while True:
    # Cycle until done then break.
    # Sleep SLEEPTIME seconds at a time and check between.
    time.sleep(SLEEPTIME)
    # Count.
    i = 0
    for key in opdict:
        assaydict = opdict[key]
        if not assaydict[FINISHED]:
            status = checkprocess(key, assaydict)
            if status:
                # kill process when report file changes
                opdict[key][PROCX].kill()
                assaydict[FINISHED] = True
                i += 1
        else:
            i += 1
    print(MESGS[DONE].format(i, numassays))
    # all done
    if i == numassays:
        break


print('\n\nFinished interpolation.\n\n')
timeend = dt.now()
elapsed = timeend - timestart


print(MESGS[TIMEELAPSED].format(elapsed.seconds))
print('\n\n{:d} elapsed minutes\n\n'.format(elapsed.seconds/60))


# Allow quick look at screen.
time.sleep(10)



Sunday, October 12, 2014

Downloading a Bunch of MP3's off the Internet (Foreign Language Tapes)

A mining bud Jen wrote a blog post lamenting the difficulty of learning a foreign language as an adult in a far off land.  This inspired me to clean up my "download the Foreign Service Institute" French "tapes" (mp3's, actually) script I wrote for myself and publish it.

I'm not very astute on web programming.  This script came out of necessity.  There may be other, more efficient ways to do this.  If you have a slow connection a piecemeal approach will probably be required.  It took about 20 minutes to get all these files over a decent Verizon MIFI unit connection (I, unfortunately, don't have speed metrics available).

Notes about the downloaded product:  the US State Department's language tapes and lessons were mostly written and produced 30 to 50 years ago.  It's not Rosetta Stone, but I have found them to have value when it comes to practicing pronunciation, including cadence and rhythm of the foreign language - things you just can't get from printed or displayed text.

My late wife gifted me some Spanish tapes prior to the internet age that helped me out.  I am by no means fluent in Spanish, but I can say Hacemos lo que podemos hasta que nos boten (this may not be entirely grammatically correct) to the Spanish speaking mining engineers and get a laugh.




The original names of the mp3's are unnecessarily long and have the appearance of having been created by the Department of Redundancy Department.  It's a government thing, but it does not reflect on the quality of the product.  While the tapes at times are socialogically and technologically dated in their subject matter, the foreign languages haven't changed all that much.



The script:  I used Python 3.4 with the urllib module's request method.  The main challenge was getting the url's of the mp3's right.  The names are not entirely consistent.  For help with this (I am using Firefox 24.3.0 on OpenBSD 5.4), I right clicked on the mp3's link and selected Inspect Element from the drop down menu:



The lower left window has the href and the link to the mp3 - if your script is not able to find the file, this is a convenient place to look.

This is the whole thing:


#!python3.4

from urllib import request

# For getting foreign language study mp3's.
# Main part of URL for French.
BASEURL = 'http://www.fsi-language-courses.org/Courses/'
MIDDLEURLI = 'French/Basic (Revised)/Volume {volume}/'
MIDDLEURLII = 'French/Basic (Revised)/Volume {0:s}/'
BASEURLEND = 'FSI - French Basic Course (Revised) '

# Format changes inexplicably at chapter 19.
# Grrrr . . .
URLI = BASEURL + MIDDLEURLI + BASEURLEND
URLI += '- Volume {volume} - Unit {unit:0>2d} '
URLI += '{unit:0>2d}.{section:0>2d}.mp3'

URLII = BASEURL + MIDDLEURLII + BASEURLEND
URLII += '- Volume {1[volume]:d} - Unit {1[unit]:0>2d} '
URLII += '{1[unit]:0>2d}.{1[section]:d}.mp3'

# Format for actual name of mp3 files.
# This is what I wanted for a name - your
# preferences may be different - adjust
# accordingly.
FILENAME = '{unit:0>2d}{section:0>2d}.mp3'

# Texts (pdf format).
# Everything the State Dept. does is a 'StudentText' -
# fair enough.
STUDENTTXT = 'StudentText.pdf'

PDFURLBASICTEXT1 = 'http://ia601400.us.archive.org/28/items/'
PDFURLBASICTEXT1 += 'Fsi-FrenchBasicCourserevised-StudentText/'
PDFURLBASICTEXT1 += 'Fsi-FrenchBasicCourserevised-Volume1-'

PDFURLBASICTEXT2 = 'http://ia801400.us.archive.org/28/items/'
PDFURLBASICTEXT2 += 'Fsi-FrenchBasicCourserevised-StudentText/'
PDFURLBASICTEXT2 += 'Fsi-FrenchBasicCourserevised-Volume2-'

PDFURLMONDEFR = 'http://ia600406.us.archive.org/3/items/'
PDFURLMONDEFR += 'Fsi-LeMondeFrancophone/Fsi-LeMondeFrancophone-'

TWO = 'Two'

# Tack on StudentText.pdf to end.
pdfs = [PDFURLBASICTEXT1, PDFURLBASICTEXT2, PDFURLMONDEFR]
pdfs = [pdfx + STUDENTTXT for pdfx in pdfs]
myfilenames = ['basictext1.pdf', 'basictext2.pdf', 'mondefrancophone.pdf']
# I'm using the dictionary keys for filenames.
pdfs = dict(zip(myfilenames, pdfs))

VOLUME = 'volume'
UNIT = 'unit'
SECTION = 'section'

# volume key, then list of two tuples of unit and
# number of sections
VOLUMES = {1:[(1, 6), (2, 6), (3, 6), (4, 7), (5, 7),
              (6, 3), (7, 11), (8, 10), (9, 11), (10, 9),
              (11, 9), (12, 4)],
           2:[(13, 8), (14, 9), (15, 10), (16, 9), (17, 11),
              (18, 7), (19, 9), (20, 8), (21, 8), (22, 7),
              (23, 8), (24, 6)]}

mp3s = []
for key in VOLUMES:
    for unitsection in VOLUMES[key]:
        for x in range(1, unitsection[1] + 1):
            mp3s.append({VOLUME:key, UNIT:unitsection[0], SECTION:x})

for mp3x in mp3s:
    # Name format change at chapter 19 :-(
    if mp3x[UNIT] > 18:
        urlx = URLII.format(TWO, mp3x)
    else:
        urlx = URLI.format(**mp3x)
    filenamex = FILENAME.format(**mp3x)
    print('Retrieving {0} . . .'.format(urlx))
    request.urlretrieve(urlx, filenamex)

# Add pdf texts at end.
for pdfx in pdfs:
    print('Retrieving {0} . . .'.format(pdfx))
    request.urlretrieve(pdfs[pdfx], pdfx)

print('Everything appears to have downloaded.')
print('Check the directory with the files to be sure.')
 
As for my French efforts, I've had better luck downloading this stuff than I have learning it.  Nonetheless, a quick message to Guido van Rossum and the other core devs:  transmettez-leur mon meilleur souvenir.

Monday, October 6, 2014

Event report: pycon.za

I managed to squeeze in a 4 day stop in Johannesburg on a recent trip that happily coincided with pycon.za.  I love pycon.us and all the other big conferences, but for value, these smaller localized cons can't be beat.

Venue:  The Campus, Bryanston

Not your average office park.  It's nicely landscaped and has a huge center beach or pitch or lawn (depending where you're from).  The buildings are all named after famous sports venues like Lemans.  The nod to us Yanks (NOT New York Yankees) in Wrigley Field was a nice touch.





Best of all 100MB/day of internet for all who enter.  That's not ideal if you're wanting to watch Youtube videos, but plenty if you just want to check a speaker bio or do con-related stuff.  I thought the organizers did a great job of keeping the con inexpensive but valuable.

The catered food and drinks were really good, by my standards at least.

Apart from an unfortunate plumbing problem in the men's bathroom the second day that was quickly repaired, everything went off without a hitch.

Talks that I went to:

Ludell-Doughtie Writing Python Code to Decide an Election Keynote - he outlined the methodology and process they used during a recent (Libyan? - there was Arabic right-to-left text in the data) election.

The main take-aways for me were
  1. Use pre-written, open source software packages to standardize things, because you won't have time to roll your own or dink with inconsistent data/code formats when you are in the thick of it.
  2. It's a huge responsibility to write code for an election and manage the data, but it's a cool project.
Steve Crawford Enabling Science with the Southern African Large Telescope with Python Doctor Crawford didn't show a lot of code in this talk, but he did outline the architecture for getting information and moving it around. The scope of the talk was way too big for code samples, but that's OK. I left feeling . . . shall we say . . . inspired . . .

My main takeaways:

Astronomy is wickedly cool and based on instrumentation, precision, and data paucity and, ironically, an overabundance of data (on average about 10GB/day, up to 50GB/day). Crawford mentioned more than once the desperate need to "catch as many photons as possible because there are so few coming in." Yeah, photons, like particles of light, just wow.

Python is used for everything where it is appropriate to use it. There are plenty of problems that don't require you to be a genius rocket scientist like Crawford.  sysadmin, data, and, perhaps most importantly, web. They're using MySQL and a web frontend to distribute data throughout the world on a daily basis to other astronomers who need it. I'm always biased toward raw data myself; it is critical, but if you can't distribute it, it's not worth much.

Good talk for me to attend.

Albert Nel - Using Python in Blender Nel is a total joker (in a respectful, entertaining, good way), but not enough of a joker to bely a serious love and enthusiasm for both Python and Blender.

My own experience with rendering 3D stuff is a little dinking around with POV-ray.  Blender is different in that it's big on animations and honoring the laws of physics.  Writing Python to automate Blender is similar to, for lack of a better analogy, writing or recording VBA macros in Excel.

Nel did a lotto ball live demo and a Lego movie ocean demo (aside:  I *LOVE* live demos, even when they go wrong - it's one of the best parts of Open Source conferences versus say, a godawful boring company Powerpoint presentation - thank you to the Nelster for accomodating us).

My takeaways:

Blender is fun.

Allison Randal The Earth is not Flat (and Other Heresies) Keynote - a lot of times I don't relate a lot to keynotes because it's about super high level programmer craft stuff (disclaimer:  I've worked as a dev, but I'm a geologist by trade) that I can't really control or understand.

So my mind wandered as Randal gracefully moved about the stage in her pixie frame and calmly laid down her knowledge.  As I much younger man I would have been thinking, "She's so smart . . . and a very attractive individual to boot . . ."  As a curmudgeony old fart my thoughts go more towards the "Damn - she's in perfect shape, speaks well, and knows what the hell she's talking about.  I'm SOOO jealous; why can't I be like that?"  In all seriousness, what always blows me away when I see Randal talk is the calm, matter of fact way she just presents facts and opinions without any malice or belligerence.

At one point she responded to a question by saying essentially, "Don't use AWS; use OpenStack <if you want to accomplish X>."  Amazon was one of the three top corporate sponsors of the event, but it wasn't a SPEAK TRUTH TO POWER/VIVE LA REVOLUCION kind of thing, just a "this is what I think based on what I know."

I'm glad she's with "us" (the open source community) instead of selling her soul to the commercial world (which she could do at great profit).

Takeaway (tongue-in-cheek) - my view of me vis a vis Allison Randal (I'm the guy on the right).

They say "kill your heroes."  Until I drop 40 lbs. and learn to express my ideas in a less conflict ridden manner, I am not ready to kill anything.  Sorry, Ms. Randal.  I hope this isn't too creepy, but you're going to remain the queen on my hero pedestal for a while :-\

Dr. David Mertz What I Learned About Python - and About Guido's Time Machine - from Reading the Python-Ideas Mailing List Keynote - David took an example of an idea for a sum function for lists and walked through all the considerations of sanity, performance, implementation, and ultimate rejection.

My takeaways:

  1. The idea has to be intuitive and make sense (he actually experimented with this socialogically - that was kind of cool).
  2. The implementation has to be consistent.
  3. Performance matters (a lot).
  4. 1 trumps 2 and 3.

Adrianna PiƄska An Introduction to Regular Expressions in Python Don't let the name fool you; this Polish lady speaks the Queen's English quite well.  She apologized (sort of) ahead of time saying she would talk too fast, but, really, the talk was paced just right.  I was really happy having gone to it.

My takeaways (for regex):

  1. Start with very general matches (.* for example) and work towards specific matches to gain skill and confidence.
Ridhwana Khan A Journey Through the Eyes of a Newbie Female Developer Very positive, professional talk, especially for a youngster.

(Aside: it's none of my business, but I think Ms. Khan is Muslim - she wore this really cool black-red combination outfit with a red head scarf - I borked my picture with my point and shoot camera, but I think a video of the talk is online.  Anyway, for a diversity-oriented talk, the outfit was not only cool and classy, but perfect for a South African con).

Ridhwana's talk was well structured with some humor interjected.  She started out with the most important point - that she loves coding and wants to do this for a career.  There were a number of valid points and ideas put forward - it's worth checking it out online.

My main takeaway:  IIRC not once did Ridhwana mention a Code of Conduct policy nor did she dwell on personal experiences with harassment.  Essentially, she has had a pretty good experience with colleagues thus far.  After a year with an all male crew (her excepted), she learned that prior to her arrival, firm rules had been established regarding off-color humor (basically banned) and such.  For me, this is a pretty good example of how some firm (but not excessively draconian) rules can help make programmer-land a women friendly place.  Ridhwana's point was that (at least in South African society) this is typically how relationships go anyway.  You meet someone, then after some time you get to know them better, and at that time, you can loosen up a bit more as appropriate.

Hallway track:  there were fewer than 150 people at this con IIRC, so if you wanted to talk to anyone, there was time.  People involved with the new kilometer array telescope project, people involved with the older telescopes northeast of Cape Town, speakers, Dr. Mertz, Allison Randal, a PhD in computational mathematics who specializes in computer vision, South African devs, the organizers of the conference - where else could a grunt open pit mine geologist like me have access to such luminosity?  pycon.za is pretty sweet.

Monday, September 1, 2014

PDF - Removing Pages and Inserting Nested Bookmarks

I blogged before about PyPDF2 and some initial work I had done in response to a request to get a report from Microsoft SQL Server Reporting Services into PDF format.  Since then I've had better luck with PyPDF2 using it with Python 3.4.  Seldom do I need to make any adjustments to either the PDF file or my Python code to get things to work.

Presented below is the code that is working for me now.  The basic gist of it is to strip the blank pages (conveniently SSRS dumps the report with a blank page every other page) from the SSRS PDF dump and reinsert the bookmarks in the right places in a new final document.  The report I'm doing is about 30 pages, so having bookmarks is pretty critical for presentation and usability.

The approach I took was to get the bookmarks out of the PDF object model and into a nested dictionary that I could understand and work with easily.  To keep the bookmarks in the right order for presentation I used collections.OrderedDict instead of just a regular Python dictionary structure.  The code should work for any depth level of nested parent-child PDF bookmarks.  My report only goes three or four levels deep, but things can get fairly complex even at that level.

There are a couple artifacts of the actual report I'm doing - the name "comparisonreader" refers to the subject of the report, a comparison of accounting methods' results.  I've tried to sanitize the code where appropriate, but missed a thing or two.

It may be a bit overwrought (too much code), but it gets the job done.  Thanks for having a look.

#!C:\python34\python

"""
Strip out blank pages and keep bookmarks for
SQL Server SSRS dump of model comparison report (pdf).
"""


import PyPDF2 as pdfimport math
from collections import OrderedDict

INPUTFILE = 'SSRSdump.pdf'
OUTPUTFILE = 'Finalreport.pdf'

OBJECTKEY = '/A'
LISTKEY = '/D'


# Adobe PDF document element keys.
FULLPAGE = '/Fit'
PAGE = '/Page'
PAGES = '/Pages'
ROOT = '/Root'
KIDS = '/Kids'
TITLE = '/Title'


# Python/PDF library types.
NODE = pdf.generic.Destination
CHILD = list


ADDPAGE = 'Adding page {0:d} from SSRS dump to page {1:d} of new document . . .'

# dictionary keys
NAME = 'name'
CHILDREN = 'children'


INDENT = 4 * ' '

ADDEDBOOKMARK = 'Added bookmark {0:s} to parent bookmark {1:s} at depthlevel {2:d}.'

TOPLEVEL = 'TOPLEVEL'

def getpages(comparisonreader):
    """
    From a PDF reader object, gets the
    page numbers of the odd numbered pages
    in the old document (SSRS dump) and
    the corresponding page in the final
    document.

    Returns a generator of two tuples.
    """
    # get number of pages then get odd numbered pages
    # (even numbered indices)
    numpages = comparisonreader.getNumPages()
    return ((x, int(x/2)) for x in range(numpages) if x % 2 == 0)


def fixbookmark(bookmark):
    """
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    """
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0


def matchpage(page, pages):
    """
    Find index of page match.

    page is a PyPDF2 page object.
    pages is the list (PyPDF2 array) of page objects.
    Returns integer page index in new (smaller) doc.
    """
    originalpageidx = pages.index(page)
    return math.floor((originalpageidx + 1)/2)


def pagedict(bookmark, pages):
    """
    Creates page dictionary for PyPDF2 bookmark object.

    bookmark is a PDF object (dictionary).
    pages is a list of PDF page objects (dictionary).
    Returns two tuple of a dictionary and
    integer page number.
    """
    page = matchpage(bookmark[PAGE].getObject(), pages)
    title = bookmark[TITLE]
    # One bookmark per page per level.
    lookupdict = OrderedDict()
    lookupdict.update({page:{NAME:title,
                             CHILDREN:OrderedDict()}})
    return lookupdict, page


def recursivepopulater(bookmark, pages):
    """
    Fills in child nodes of bookmarks
    recursively and returns dictionary.
    """
    dictx = OrderedDict()
    for pagex in bookmark:
        if type(pagex) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(pagex, pages)
            dictx.update(lookupdict)
        elif type(bookmark) is CHILD:
            newdict = OrderedDict()
            newdict.update(recursivepopulater(pagex, pages))
            dictx[page][CHILDREN].update(newdict)
    return dictx


def makenewbookmarks(pages, bookmarks):
    """
    Main function to generate bookmark dictionary:

    {page number: {name:<name>,
                   children:[<more bookmarks>]},
                   and so on.

    Returns dictionary.
    """
    dictx = OrderedDict()
    # top level bookmarks
    # it's going to go bookmark, list, bookmark, list, etc.
    for bookmark in bookmarks:
        if type(bookmark) is NODE:
            # get page info and update dictionary with it
            lookupdict, page = pagedict(bookmark, pages)
            dictx.update(lookupdict)
        elif type(bookmark) is CHILD:
            dictx[page][CHILDREN] = recursivepopulater(bookmark, pages)
    return dictx


def printbookmarkaddition(name, parentname, depthlevel):
    """
    Print notification of bookmark addition.

    Indentation based on integer depthlevel.
    name is the string name of the bookmark.
    parentname is the string name of the parent
    bookmark.

    Side effect function.
    """
    args = name, parentname, depthlevel
    indent = depthlevel * INDENT
    print(indent + ADDEDBOOKMARK.format(*args))


def dealwithbookmarks(comparisonreader, output, bookmarkdict, depthlevel, levelparent=None, parentname=None):
    """
    Fix bookmarks so that they are properly
    placed in the new document with the blank
    pages removed. Recursive side effect function.

    comparisonreader is the PDF reader object
    for the original document.


    output is the PDF writer object for the
    final document.


    bookmarkdict is a dictionary of bookmarks.

    depthlevel is the depth inside the nested
    dictionary-list structure (0 is the top).


    levelparent is the parent bookmark.

    parentname is the name of the parent bookmark.
    """
    depthlevel += 1
    for pagekeylevel in bookmarkdict:
        namelevel = bookmarkdict[pagekeylevel][NAME]
        levelparentii = output.addBookmark(namelevel, pagekeylevel, levelparent)
        if depthlevel == 0:
            parentname = TOPLEVEL
        printbookmarkaddition(namelevel, parentname, depthlevel)
        fixbookmark(levelparentii)
        # dictionary
        secondlevel = bookmarkdict[pagekeylevel][CHILDREN]
        argsx = comparisonreader, output, secondlevel, depthlevel, levelparentii, namelevel
        # Recursive call.
        dealwithbookmarks(*argsx)


def cullpages():
    """
    Fix SSRS PDF dump by removing blank
    pages.
    """
    ssrsdump = open(INPUTFILE, 'rb')
    finalreport = open(OUTPUTFILE, 'wb')
    comparisonreader = pdf.PdfFileReader(ssrsdump)
    pageindices = getpages(comparisonreader)
    output = pdf.PdfFileWriter()
    # add pages from SSRS dump to new pdf doc
    for (old, new) in pageindices:
        print(ADDPAGE.format(old, new))
        pagex = comparisonreader.getPage(old)
        output.addPage(pagex)

    # Attempt to add bookmarks from original doc
    # getOutlines yields a list of nested dictionaries and lists:
    #    outermost list - starts with parent bookmark (dictionary)
    #        inner list - starts with child bookmark (dictionary)       
    #                     and so on
    # The SSRS dump and this list have bookmarks in correct order.
    bookmarks = comparisonreader.getOutlines()
    # Get page numbers using this methodology (indirect object references)
    #
http://stackoverflow.com/questions/1918420/split-a-pdf-based-on-outline
    # list of IndirectObject's of pages in order
    pages = [pagen.getObject() for pagen in
            comparisonreader.trailer[ROOT].getObject()[PAGES].getObject()[KIDS]]
    # Bookmarks.
    # Top level is list of bookmarks.
    # List goes parent bookmark (Destination object)
    #               child bookmarks (list)
    #                   and so on.
    bookmarkdict = makenewbookmarks(pages, bookmarks)
    # Initial level of -1 allows increment to 0 at start.
    dealwithbookmarks(comparisonreader, output, bookmarkdict, -1)

    print('\n\nWriting final report . . .')
    output.write(finalreport)
    finalreport.close()
    ssrsdump.close()
    print('\n\nFinished.\n\n')


if __name__ == '__main__':
    cullpages()