Friday, March 28, 2014

Editing a PDF file with Python (with a little help from PDFTKBuilder)

I'm working with a report published with SQL Server Reporting Services (SSRS).  The report is located on a remote server in Africa.  It is inconvenient for management in North America to view the report and print it from a browser (slow connection, formatting issues).  Instead, management would like a PDF file of the report to be e-mailed out to a distribution list.

This post deals with taking the PDF dumped from the SSRS web report and cleaning it up for viewing and navigation (bookmarking).  I didn't know a great deal about PDF's before working on this.  My ignorance will probably be reflected in the terminology I use and my approach.  Nonetheless, the problem was a bit more involved than I anticipated.  My intent is to put my experience out there and, if I have made things harder than necessary, get some feedback in the comments.

I think it's fair to say that SSRS is not a mature product yet, but, in a Microsoft/Windows environment its usefulness trumps that.  The dump to PDF or Excel feature for reports is handy, but doesn't always yield an output format consistent with the SSRS web report.  The first problem I had was a "corrupt" PDF dump.  The file opens fine in Acrobat Reader, but doesn't behave well when one attempts to copy its contents with modifications to another file with PyPDF2 (this is just a straight copy of pages from one pdf file to another new one):

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> dumpfile = open('baddumpfromssrs.pdf', 'rb')
>>> reader = pdf.PdfFileReader(dumpfile)
>>> numpages = reader.getNumPages()
>>> numpages
54
>>> outputfile = open('testoutput.pdf', 'wb')
>>> writer = pdf.PdfFileWriter()
>>> for x in xrange(numpages):
...     writer.addPage(reader.getPage(x))
...
>>> writer.write(outputfile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 279, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 352, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 367, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 381, in _sweepIndirectReferences
    newobj = self._sweepIndirectReferences(externMap, newobj)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 343, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 372, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "c:\python27\lib\site-packages\PyPDF2\pdf.py", line 1164, in getObject
    retval = readObject(self.stream, self)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 71, in readObject
    return DictionaryObject.readFromStream(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 587, in readFromStream
    value = readObject(stream, pdf)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 91, in readObject
    return NumberObject.readFromStream(stream)
  File "c:\python27\lib\site-packages\PyPDF2\generic.py", line 257, in readFromStream
    return NumberObject(num)
ValueError: invalid literal for int() with base 10: ''
>>>


Bummer.  I had to google around until something showed up on Stackoverflow.  There's a comment on the post that suggests the use of pdftk to "un-corrupt" the file.  To avoid having to have an admin install something on my work computer, I downloaded PDFTKBuilder Portable.  This is really overkill, because I didn't need the user interface to clean up the file.  There is an App folder in the pdftk portable install that has the pdftk command line tool:

C:\blogdoc>pdftk baddumpfromssrs.pdf output good.pdf

This "worked" in terms of preparing the file to be dealt with with Python and PyPDF2, but not before I opened it in Adobe Reader and closed it.  I'm working in a corporate environment under Windows 7.  I double checked to see that I had the same command in the command window (I used the up arrow key to recall it to issue the command that worked).  I don't know what's going on there.  The important thing was that I could proceed with the non-corrupt file good.pdf.

While I'm on things that weren't working, I should probably mention the Python 3/Python 2 thing.  This originally gave an error on Python 3; when I tried to reproduce the problem, it hung forever and I had to kill it with Ctrl-C:

Python 3.4.0 (v3.4.0:04f714765c13, Mar 16 2014, 19:24:06) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> pagex = reader.getPage(0)
>>> pagex.extractText()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2070, in extractText
    content = ContentStream(content, self.pdf)
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2153, in __init__
    self.__parseContentStream(stream)
  File "c:\python34\lib\site-packages\PyPDF2\pdf.py", line 2173, in __parseContentStream
    operator += tok
KeyboardInterrupt


I'm a bit of a Python 3 advocate, sometimes even a zealot.  Still, pain won over my conviction and I switched to Python 2.7 where I got better results.  There is mention of this problem (with error) on StackOverflow.  A comment makes mention of replacing a couple PyPDF2 source files to make sure it runs with Python 3.  I couldn't find the link and took the expedient Python 2.7 route.

This is about where everything started to work the way it was supposed to.  Now I could get down to fixing the SSRS pdf report dump.  The first thing that needed to happen was the removal of a bunch of blank pages from the report.  On the SSRS web report they weren't there, but the PDF file had a blank page everywhere there was a page break.  Conveniently, this was every other page:

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> inputfile = open('good.pdf', 'rb')
>>> reader = pdf.PdfFileReader(inputfile)
>>> numpages = reader.getNumPages()
>>> numpages
54
>>> contentpages = (x for x in xrange(numpages) if x % 2 == 0)
>>> writer = pdf.PdfFileWriter()
>>> for n in contentpages:
...     pagex = reader.getPage(n)
...     writer.addPage(pagex)
...
>>> writer.write(outputfile)
>>> outputfile.close()
>>> inputfile.close()


Super!  Now I've got a 27 page document with content on every page.  The next thing I needed in my case was a banner or mark across each page saying, "DRAFT FORMAT."  The specific idea was that this was a sample report being circulated for comments and approval.

I didn't want super bold red text across the page, rather white text outlined in red.  Some googling paid off with a suggestion from a mailing list.  reportlab.pdfgen is the tool used for creating the file with the banner.  We'll merge it to the pages of the main report document later.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> # want red outline
...
>>> c.setStrokeColor((1, 0, 0))
>>> # inside of letters should be white
...
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> # origin is at bottom, left of page
...
>>> c.drawString(2 * 72, 7 * 72, 'DRAFT FORMAT')
>>> c.save()
>>>
>>>


Great, I've got a banner.


There is a whole bunch of stuff in that code segment that I'm leaving unexplained.  Not a big surprise, but to use a Python API to edit PDF's, you need to know something about the format.  This has been a huge learning experience over the course of a day or two for me.  What helped me most is the reportlab documentation.  After copying a code snippet and seeing that it worked, I could go back there and try to figure out how it works.  This learning experience is a work in progress.  There are things you pick up right away, though.  For instance, Adobe Reader comes with 14 base fonts of which Hevletica is one.  Who knew?  Not I!

My banner isn't quite the way I want it.  It's horizontal and I would like to tilt it to 45 degrees.  Google again to the rescue.  Some kind soul has already covered it on a blog.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('banner.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> c.setFillColor((1, 1, 1))
>>> c.setLineWidth(1.0)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT FORMAT')
>>> c.save()
>>>


Close enough.  Confession - I don't really think things through and measure with trigonometry what it will take to get placement right; I just "hack" until it looks about right.  This is a habit I should break if I continue to have to play with pdf's.





 
 
Now we'll merge the banner with some pages from another pdf to make a new document.  I'm going to use pages from the reportlab documentation because there's all kinds of work stuff in the pdf I generated above.

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import PyPDF2 as pdf
>>> bannerfile = open('banner.pdf', 'rb')
>>> docfile = open('docfile.pdf', 'rb')
>>> outputfile = open('newfile.pdf', 'wb')
>>> readerbanner = pdf.PdfFileReader(bannerfile)
>>> readerdoc = pdf.PdfFileReader(docfile)
>>> writernewdoc = pdf.PdfFileWriter()
>>> pagesdoc = (readerdoc.getPage(x) for x in xrange(286, 291))
>>> for pagen in pagesdoc:
...     writernewdoc.addPage(pagen)
...
>>> writernewdoc.write(outputfile)
>>> outputfile.close()
>>> docfile.close()
>>> # now merge banner to pages of new file
...
>>> opaquebannerfile = open('opaquebannerfile.pdf', 'wb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> bannerpage = readerbanner.getPage(0)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> writeropaquebanner = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeropaquebanner.addPage(pagex)
...
>>> writeropaquebanner.write(opaquebannerfile)
>>> opaquebannerfile.close()
>>> bannerfile.close()
>>> testpagefile.close()
>>>


It's not perfect, but it's essentially what I wanted (my centering of the banner could be better).



What if I wanted a transparent banner to emphasize the draft nature of the content rather than that of the format?

Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.pdfgen import canvas as canx
>>> c = canx.Canvas('transparent.pdf')
>>> c.setStrokeColor((1, 0, 0))
>>> transparentwhite = canx.Color(255, 255, 255, alpha = 0.0)
>>> c.setFillColor(transparentwhite)
>>> t = c.beginText()
>>> t.setTextRenderMode(2)
>>> c._code.append(t.getCode())
>>> c.setFont('Helvetica', 48)
>>> c.saveState()
>>> c.translate(100, 100)
>>> c.rotate(45)
>>> c.drawCentredString(500, 100, 'DRAFT')
>>> c.save()
>>>
>>> # merge again
...
>>> transparentbannerfile = open('transparent.pdf', 'rb')
>>> testpagefile = open('newfile.pdf', 'rb')
>>> outputfile = open('mergedtransparent.pdf', 'wb')
>>> import PyPDF2 as pdf
>>> readerbanner = pdf.PdfFileReader(transparentbannerfile)
>>> readertestpages = pdf.PdfFileReader(testpagefile)
>>> bannerpage = readerbanner.getPage(0)
>>> writeroutput = pdf.PdfFileWriter()
>>> for x in xrange(readertestpages.getNumPages()):
...     pagex = readertestpages.getPage(x)
...     pagex.mergePage(bannerpage)
...     writeroutput.addPage(pagex)
...
>>> writeroutput.write(outputfile)
>>> outputfile.close()
>>> transparentbannerfile.close()
>>> testpagefile.close()
>>>


Not beautiful (I would make the banner font edge thinner), but it is indeed transparent.

 

The transparency part is the alpha value in the code for the color transparentwhite.  There is some sample code that shows how to do this on reportlab.com's site.

The last thing I needed to deal with was bookmarks.  I had some problems initially in that, although the bookmark showed up, it ended up at the bottom of the page underneath the SSRS tables and charts I was trying to reference.  I got around this by digging into the dictionary structure of the PyPDF2 Bookmark object.  Here is the (one line) function code:

OBJECTKEY = '/A'
LISTKEY = '/D'


FULLPAGE = '/Fit'

def fixbookmark(bookmark):
    """
    bookmark is a PyPDF2 bookmark object.

    Side effect function that changes bookmark
    page display mode to full page.
    """
    # getObject yields a dictionary
    props = bookmark.getObject()[OBJECTKEY][LISTKEY][1] = pdf.generic.NameObject(FULLPAGE)
    return 0


addBookmark is a method of the PyPDF2.PdfFileWriter object.  It takes a string name, a page index (zero based), and an optional parent PyPDF2 Bookmark object.  The references in my fixbookmark function "take" prior to writing the pdf to disk with the write method of the PyPDF2.PdfFileWriter object.

Mike Driscoll blogged about PyPDF2 a couple years back.  He's got a whole series on PDF's in fact (aside: the man is a pragmatic programming blogging machine).  There are good code snippets and pretty good comment threads on those posts for newbs like me.  I found that rl118.pdf doc useful for familiarizing myself with the pdf file format and constants used to reference objects within the file format.

This was a bit of an experience dump on my part.  If you've read this far, thank you for your patience and for having a look.



Tuesday, March 18, 2014

(Windows) LogParser - Install Without Admin Rights

A twitter acquaintaince @zippy1981 recommended the Window's software LogParser as a replacement for MSSQL bcp for my data transfer needs.  I downloaded the msi file from Microsoft and tried to install it.  As is true with a lot of software at work, I got a message saying the software can't be installed without admin rights.

I tweeted @zippy1981 (actually Justin Dearing in "real life") back saying I couldn't install.  He suggested using 7zip to decompress the msi file.  I downloaded 7zip portable and followed the instructions and ended up with files with names like these:

LogParser_dll.B1735C0B_1CB5_4257_8281_92109AE41CE6

The names are not handy for the executable, nor will they work, but they are easy enough to decipher - there's an underscore between the extension and a period following the filename with a long string of characters.

Here is the mini, somewhat clunky script I wrote for "fixing" the filenames (I used Python 3.3):


"""
Remove extensions from extracted
msi files.
"""


DIRX = 'C://UserPrograms//LogParserWorking//'

import os
import shutil

filenames = []

x = os.walk(DIRX)
# generator
for y in x:
    # lists of files
    for filex in y[2]:
        filenames.append(filex)


for filex in filenames:
    # rip off end
    # change _ to .
    print(filex)
    # reverse
    filey = filex[-1::-1]
    # strip
    dotx = filey.find('.')
    filey = filey[dotx + 1:]
    # replace underscore
    underscore = filey.find('_')
    firstpart = filey[:underscore]
    firstpart += '.'
    secondpart = filey[underscore + 1:]
    filey = firstpart + secondpart
    filey = filey[-1::-1]
    print(filey)
    shutil.move(DIRX + filex, DIRX + filey)


And voilá - I've got LogParser without having to bother our IT people for an install.
 
I'm probably late to the party on this msi extraction concept.  Still, I thought there might be other people who are as unaware of it as I was, so I'm blogging it.  Thanks for having a look.

Saturday, February 9, 2013

Generators Used as Counters

I've a bit late to the Python generator party.  My main problem in adoption was my lack of understanding of where I could apply the generator function concept.  I always wanted to apply this tool to counters in a counting based problem and I recently had an opportunity to do so.

Confession:  every time I do a counter based problem, I end up with off by one errors, so much so that I have to go through at least three iterations to get it right.

Generator functions used in part as counters are what got me through this.  The script was a mess and the use of the yield statement inside a couple functions cleaned it up for me to the point where it only took one more iteration to get it right (I had spent a weekend and about six iterations prior to this).

The main utility of the yield statement for me has three parts:

    1) it "remembers" where it is and where it stopped within the generator between next() calls.

    2) it "remembers" the values within itself (in this case a counter) between next() calls.

    3) because of 1) and 2) it's harder (for me, at least) to corrupt values and state than it is by passing around a counter variable and applying += operator calls.

The (somewhat convoluted) scenario:

I'm working at a mine in Africa.  The boss comes to me with a spreadsheet of about 500 items from various data sources and requests a database and a report that looks just like it and updates any time it's called.  Difficult, but not impossible.  I start on a schema for a database.  Considering I'm a geologist, it's not all bad:


 









Coworker comes by not long after and informs us that he is not keen on the idea of a custom database and that the problem should be done using SharePoint lists as lookup tables.  MS SQL 2008 r2 SSRS (SQL Server Reporting Services) is what we and our mine software vendor use for reporting.

I start to work on the SharePoint lists and hit a snag.  SSRS in its current state can only operate on a single dataset when you're presenting information in a table.  My main dataset in this case is the list of performance indicators or assays my boss asked for in the first place.  That list is in a SharePoint list.  Further, SSRS can only take one SharePoint list as a data source.  In other words, to the best of my knowledge, you can not write a relational query across SharePoint lists in SSRS 2008 r2.  My solution was to lookup values within my table in SSRS based on names or keys.  It got ugly and string concatenation was involved, but, amazingly, it worked.

The only real problem, apart from lousy performance, was the matching of shipment data from the mine based on mine levels as required in the spreadsheet.  What I needed to do was create a separate record in the main KPI list for each KPI-pit-material(roughly economic character of the rock)-bench-subbench combination.  There was a last column dealing with mining method, but for the time being I could get that from the subbench number.

Anyway, 4 or 5 data sources requiring elevations, 4 (currently, there will be more) open pits, 5 KPI's, 9 or 10 materials, about 40 benches, 9 subbenches - it ends up being a lot of records (5 * 4 * 5 * 10 * 40 * 9 ~= 360,000).  In the end I had about 350,000 records total.  I actually started doing this by hand in an Excel spreadsheet.  I got as far as replicating everything for the four pits and knew I would have to do the rest programmatically.  Knowing what I know now, I would have done the whole thing programmatically.  Live and learn.

Shown below is the code (sorry about the plain look - I'm not with it with Syntax Highlighter and other tools yet).  There are some naive constructs and hacks, but it works.  Thanks for having a look.

"""
Deal with main list of KPI's in
Jeff's reconciliation report.
Deal with elevation, flitch, and mining method here.
Assign key values (integers) to
relevant columns.
"""

import csv
import collections as colx
import os
import sys
import pprint

# ranges that require bench/flitch addition
import benchranges as br

# bench range file format:
#
# """
# Row ranges in <boss's> Excel spreadsheet
# that require repetition for bench and flitch.
# """
#
# RANGES = ((36, 42),
#           (42, 48),
#           (48, 54),
#           (54, 60),
#           (60, 66),
#           (66, 72),
#           (72, 78),
#           (78, 84),
#              .
#              .
#              .
#             etc.

DIRX = r'C:\MSPROJECTS\SiteTripJAN13\JeffDatabase\FirstAttemptWDatabase\LookupDataSetup'
DIRCSV = r'C:\MSPROJECTS\SiteTripJAN13\JeffDatabase\FirstAttemptWDatabase\LookupDataSetup\loadtosharepoint\spreadsheetsforload\cvsdumps'
CSV = '.csv'

KPIFIELDS = ['KPIID', 'KPIOrder', 'AttributeDescription', 'Units', 'DataType', 'Source', 'ValueCalc', 'Period', 'Location', 'Pit', 'Level', 'MiningMethod']

NTKPIFIELDS = ','.join(KPIFIELDS)

# account for extra level, flitch, and BHVermeer files
CSVFILES = KPIFIELDS[2:] + ['Benches', 'BHVermeer', 'Flitches']

MAINFILE = 'kpiids'

BIGNUMBER = 1000000

# OK, the only file that needs a named tuple is the KPI one.
# The others all have two columns where we're trying to match
# basic format of KPI lookup csv files (faux unit example):
#
# 1,Frobnostication unit
# 2,Foo unit
# 3,ounces
# 4,gallons per minute
#     .
#     .
#     .
#    etc.

# format of main input file:
#
# 1,1,Plant Harvest Salable Copper,Tonne Metal,Plant Actual,Mill Report,Value,Shift Day,Plant,Not Applicable,Not Applicable,Not Applicable
# 2,2,Plant Contained Salable Cobalt,Tonne Metal,Plant Actual,Mill Report,Value,Shift Day,Plant,Not Applicable,Not Applicable,Not Applicable
# 3,3,Mill Tonnage,Tonne Ore,Plant Actual,Mill Report,Value,Hour,Plant,Not Applicable,Not Applicable,Not Applicable
#     .
#     .
#     .
#    etc.

# intended format of output file:
#
# 1,1,120,10,4,2,2,3,3,3,2,42,10,2,4
# 2,2,119,10,4,2,2,3,3,3,2,42,10,2,4
# 3,3,108,11,4,2,2,1,3,3,2,42,10,2,4
#     .
#     .
#     .
#    etc.

# have to add slots for pit, bench, flitch, mining method
# 14 slots all integers
# XXX - very much hard coded
# 'KPIID', 'KPIOrder', 'AttributeDescription', 'Units'
# 'DataType', 'Source', 'ValueCalc', 'Period'
# 'Location', 'Pit', 'Level', Bench, Flitch'
# 'MiningMethod', BHVermeer

MAINFILELINEFMT = (11 * '{{{}}},').format(*[x for x in KPIFIELDS[:11]])
MAINFILELINEFMT += '{Bench},{Flitch},{MiningMethod},{BHVermeer}'

KPIID = 'KPIID'
KPIORDER = 'KPIOrder'
PIT = 'Pit'
BENCH = 'Bench'
FLITCH = 'Flitch'
BHVERMEER = 'BHVermeer'
NA = 'Not Applicable'

# specific for each item (1FEB2013)
NABENCH = 42
NAFLITCH = 10
NABHVERMEER = 4
NAPIT = 3
EXCEL = 'excel'

def makelookupdict():
    """
    Read the available csv files for the SharePoint
    lookup tables and make a corresponding dictionary
    of integer values corresponding to each value
    (string) key.

    Returns dictionary.
    """
    os.chdir(DIRCSV)
    # make lookup table of values
    lookupdict = {}
    for filex in CSVFILES:
        f = open(filex + CSV, 'r')
        lookupdict[filex] = {}
        readerx = csv.reader(f, dialect = EXCEL)
        for linex in readerx:
            keyx = linex[1]
            lookupdict[filex][keyx] = linex[0]
        f.close()
    return lookupdict

def newnum():
    """
    Generator function that will continuously
    increment an integer and yield the new
    value.
    """
    x = 1
    while 1:
        yield x
        x += 1

def initializerecord(recorddict, recordnt, lookupdict):
    """
    Put corresponding integer lookup values
    for fields in dictionary.

    recorddict is an empty dictionary.

    recordnt is a named tuple of record values.

    lookupdict is a dictionary of lookup indices
    for the values in question.
    """
    for namex in KPIFIELDS[2:]:
        # XXX - dictionary lookup is a bit hackish
        recorddict[namex] = lookupdict[namex][recordnt.__dict__[namex]]

def dummyfieldsforkpi(recorddict, n):
    """
    Fills in Not Applicable for fields in KPI
    that doesn't require them (bench, flitch, etc.)

    recorddict is the dictionary for the KPI
    in question.

    n is the KPI identifier and order number
    of original KPI file.
    """
    recorddict[BENCH] = NABENCH
    recorddict[FLITCH] = NAFLITCH
    recorddict[BHVERMEER] = NABHVERMEER
    recorddict[KPIID] = n
    recorddict[KPIORDER] = n

def dealwithbenchloop(n, counter, recorddicts, f2):
    """
    Generator function.

    Fill in rows for benches and flitches for a given
    set of grouped KPI's.

    n is the current line number of the file being
    written to.

    counter is the generator for incrementing line
    numbers is the file being written to.

    recorddicts is the base list of grouped record
    dictionaries.

    f2 is the file being written to.

    Returns the position the line number counter
    for the new file.
    """
    # loop through benches
    for benchx in range(1, 42):
        # loop through flitches, increment counter
        # 1-9 corresponds to flitch indices in original flitch list (1-9)
        for flitchx in range(1, 10):
            for recn in recorddicts:
                # newlineno will represent KPIID, KPIOrder, add to dict
                recn[KPIID] = n
                recn[KPIORDER] = n
                # match bench, flitch, bhvermeer with id's and add to dict
                recn[BENCH] = benchx
                recn[FLITCH] = flitchx
                if flitchx == 9:
                    recn[BHVERMEER] = 1
                else:
                    recn[BHVERMEER] = 2
                # write dictionary to file
                print(MAINFILELINEFMT.format(**recn), file = f2)
                n = next(counter)
                yield n

def cyclethroughkpis():
    """
    Generator function that will cycle through
    a list of KPI (key performance indicator)
    records and write each one to a new file
    adding records for benches and flitches
    where appropriate.
    """
    # get main file and assign lookup values
    os.chdir(DIRX)
    f = open(MAINFILE + CSV, 'r')
    f2 = open('kpilookups.csv', 'w')
    # named tuple for cycling through file
    recordx = colx.namedtuple('recordx', KPIFIELDS)
    # counter for new file
    x = newnum()
    # map object of records
    # want line number and named tuple records in an iterable
    ziplines = zip(range(1, BIGNUMBER), map(recordx._make, csv.reader(f, dialect = EXCEL)))
    # lookup dictionary
    lookupdict = makelookupdict()
    # multiple bench original file line index ranges
    benchidxs = (range(*z) for z in br.RANGES)
    # increment new file counter
    n = next(x)
    while 1:
        recorddict = {}
        try:
            zipnt = next(ziplines)
        except StopIteration:
            f.close()
            f2.close()
        initializerecord(recorddict, zipnt[1], lookupdict)
        # if this requires multiple repeated records for bench and flitch
        if int(recorddict[PIT]) != NAPIT:
            # list of grouped record dictionaries
            recorddicts = [recorddict]
            # get range of rows for grouped KPI's
            rowrange = next(benchidxs)
            # first item in rowrange already current
            for i in rowrange[1:]:
                zipnt = next(ziplines)
                recorddict = {}
                initializerecord(recorddict, zipnt[1], lookupdict)
                recorddicts.append(recorddict)
            r = dealwithbenchloop(n, x, recorddicts, f2)
            # track new file counter
            for w in r:
                yield w
            # track counter
            n = w
            yield n, zipnt[0], rowrange
        else:
            # make record complete
            dummyfieldsforkpi(recorddict, n)
            # write to file
            print(MAINFILELINEFMT.format(**recorddict), file = f2)
            # yield current file counter, original file counter, and row range
            yield n, zipnt[0], None
            n = next(x)

if __name__ == '__main__':
    dorecords = cyclethroughkpis()
    while 1:
        try:
            print(next(dorecords))
        except StopIteration:
            print('Finished run.')
            # XXX - a bit hackish
            sys.exit()
f.close()
f2.close()

Monday, December 24, 2012

Larry Berry (Mintec) 1962 - 2012

This is another somewhat philosophical post.  I promise I will post code and tangible how-to stuff again.

A mentor of mine, the individual who turned me on to Python, has passed on.  The whole deal was a bit sudden and really gave me pause.  What I'll try to do here is briefly summarize Larry's body of work, and put it in perspective with regard to software development and domain specific software.  Although his efforts won't see headlines in either mining circles or software ones, what he did was pretty amazing.

Larry worked for Mintec, a mine software vendor, from 1990 on. His specialty and his passion were Ore Control, the place where mine operations and engineering come together, the place, figuratively speaking, where the rubber meets the road. This part of the mining process, in my opinion, was often the bastard child of most upper management. My impression was that at the time Larry began his career at Mintec not all, but many management types had the idea that if you just planned the mine right and threw a bunch of grunts into the pit with some equipment, everything would magically happen. Tactical execution and measurement of that process are important. Larry was one of the people who recognized this immediately and did something about it. He championed until his death the idea that Ore Control and short range planning were information intensive parts of the mining process that demanded an operational database and a programming API that allowed for dynamic process configuration.

Around the year 2000, Mintec released a planning database and a Python API. Larry was very much a software implementer at the time and not a developer. Nonetheless, his input was an important part of the feedback loop within the relatively small mine software company. He told me that they had been looking at Lua as an API. It did not particularly work for him and they settled on Python instead. By comparison, I've heard that one of the other software mining companies uses Perl for its API, and another one VBA. I'm biased, but I think that for ease of learning for the customer and internal productivity, Mintec made the right choice. They started out with Python 2.1, then, along with some changes to the API, switched to Python 2.2. Today we are using Python 2.5.

As a programmer, Larry was mixed. He was a miner first, and anything that got in the way of his goal was a mere distraction. His code was clean, but seldom bulletproof. I would call him an offensive programmer, the opposite of a defensive one. He was an upbeat optimist. I was and am a somewhat morose pessimist. Much to the entertainment of those working around us, we would argue about code. The Larryism that caused me the most heartburn was the variable declared inside a conditional block (Python is a dynamic language that allows you to do that) that under a corner case didn't get executed.

Larry: Well if your data were squared away it would work!

Me: LARRY! Our data are NEVER squared away!

The crown achievement of Larry's Mintec career, in my estimation, was not so much his code, but his driving through of the whole MineSight Operations package, basically the Ore Control software I described above, internal to Mintec and within the mining industry. I saw things from the client side where Larry was the vendor and I represented the mining company. I can attest to the blood, sweat, and tears involved. Spending weeks away from your family doing an implementation in a decidedly unergonomic, dusty mine office and debugging software over the phone on Christmas Day are things you can't pay someone enough to do. Larry was a true believer in his product and did what was necessary to make it happen.

Ironically, Larry was promoted to Project Manager for MineSight Operations about eight months before his death. He had arrived after all that hard work. I would tease him about spending his time in a corner office with the head of the company smoking cigars and sipping scotch. He laughed. The guy never stopped working. He was very much a type A personality. The corporate life would have been easier on him and his family. I don't know how well he would have liked it long term. He loved to be in the thick of things, fixing things, making things work beautifully.

The main lesson I can glean from Larry's life is that if you want to do something significant, liking what you do is not enough. You have to care intensely and passionately about what you're doing. He was one hell of an individual.

Thursday, December 13, 2012

Plan for Pycon 2013

I looked at the Pycon talks and events schedule and was a bit overwhelmed.  There's just too much awesome, which is, for me, a bad thing.  Basically I've got to narrow down a strategy and a scope for attending this thing or I'll just learn nothing about everything and come back to work in Arizona and Africa totally burned out from lack of sleep.

What I've tried to do is group things in categories that are relevant to what I do for a living or what interests me to the point I would make the time to actually work on them.  I then listed the talks in these categories (mine, not the conference organizers') on paper and ranked them within each category.
The pattern that fell out for me is that I would be best served by focusing on Python 3.3 and how to use it best.  For me, that's going to be the theme of the conference, even if there are other more prominent official themes.

Disclaimer:  I'm a geologist/wannabe mining engineer/sometimes programmer/sometimes developer who is trying to write decent code and maintainable software.  Among the Pycon crowd, I'm probably well into the lower 50th or 25th percentile in terms of computer literacy and programming skills.  Among mining professionals I'm probably somewhere in the upper 25th percentile in terms of those skills.  It's all relative.  What's good for me is probably not the same as what's good for you.  Hopefully this gives other people some idea of how they would like to plan their conference experience.

A) Core Language for the User

1) How to Except When You're Excepting - Nam
2) Become a logging expert in 30 minutes - G. Roy
3) Iteration & Generators: the Python Way - Ramalho
4) Loop like a native: while, for, iterators, generators - Batchelder
5) Python's Class Development Toolkit - Hettinger
6) Transforming Code into Beautiful, Idiomatic Python - Hettinger
7) Encapsulation with descriptors - Ramalho

This is all basic language stuff.  I need the repetition and reinforcement.  "Yeah, but Carl, TWO talks on iteration?!  Don't you think you're taking this whole repetition/looping thing too far?"  No.  When you do summing of tonnes and grade and weighted averages and binning of material types and truck counts and processing of millions of individual 3D geologic model blocks for a living, there are never enough ways or efficient enough programming language idioms.  iter(), like, um, forever . . .

B) Python Internals

1) The Guts of Unicode in Python - Peterson
2) All-Singing All-Dancing Python Bytecode - Hastings
3) How Import Works - Canon
4) sys._current_frames(): Take real-time x-rays of your software for fun and performance - Rochael

I'm probably going to actually understand about 25% of what I hear IF I actually do some research ahead of time.  Still, knowing how something actually works can be a huge advantage.

C) Miscellaneous Stuff

1) Why you should use Python 3 for text processing - Mertz
2) How the Internet works - McKellar
3) Internationalization and Localization Done Right - Varshney
4) Python for Robotics and Hardware Control - Foote
5) Realtime Tracking and Mapping of Geographic Objects using Python - Burhum
6) Location, Location, Location - Grace
7) The Magic of Metaprogramming - Rush
8) Integrating Jython with Java - Baker
9) Getting Python and Django Through Your Java Shop Front Door (with Jython) - Wierzbicki

All stuff that I've had to deal with in the workplace.  All stuff that I wished I had been smarter about :-\

D) Community

1) Scaling community diversity outreach - Laroia
2) Designers + Developers: Collaborating on your Python project - Elman
3) How (Not) To Build An OSS Community - Lindsley

People hacking is not my strong suit.  Anna Ravenscroft gave a talk at Pycon back in 2010 on diversity.  I had a couple "Huh, I never thought of that.  Yeah," moments during that talk.  If I can glean a handful more of those "Huh" moments from these talks, they are more than worth it.

E) Tutorials

1) A Gentle Introduction to Computer Vision - Scott
2) Python 3 Metaprogramming - Beazley

We use a fair bit of computer vision in mining and that will probably increase as we move into the future.  The main application I can think of is estimation of rock fragmentation size with SPLIT imagery.  I actually own a VB book that covers a lot of basic concepts in a programming context (Visual Basic Graphics Programming - Stephens).  The problem is that it's 1) in a dead language - VB6 and 2) not in Python.

Beazley's course is going to be sick.  I will be truly disappointed if my head doesn't explode along with everyone else's and there's not a mini zombie apocalypse inside the classroom ;-)

F) PyPgDay 2013 - San Francisco PostgreSQL User Group

I've been to these PostgreSQL days the day before OSCON a time or two.  They are generally fantastic.  I really like the PostgreSQL community and the rigor they put into making robust database software.  Berkus or Deckelman, or both should be there.  Good, smart folks!  I am looking forward to this.

F) Keynotes

1) Hettinger
2) McKellar

Confession:  I am not into keynotes.  Basically it's first thing in the morning, you're beat from travel.  You herd into a huge dark conference room with a zillion other cattle, er, people.  If the talk, like a movie, doesn't make you forget about how you feel it can be a bit of a soul crushing letdown.  Back in 2010 Guido changed the format of his keynote to a twitter Q & A.  I hated it; some people loved it.  Yes, I am an uptight inflexible old fart who could stand to loosen up.  Yes, Guido is an indisputably great guy.  That said, I don't think I'll re-Guido myself on the keynote end for some time to come.

Rasberry Pi is a wildly cool innovation.  Sadly, I still really haven't successfully hacked on the Arduino.  When people are mastering the great grandchild to the Rasberry Pi, I'll finally be getting started on the Arduino when I'm in a nursing home.  Same problem as I mentioned at the outset - too much awesomeness, too much of a good thing.

Hettinger is really in the zone on the Python language end of things and cool new ways to use it.  He tweets tips constantly and has an infectious enthusiasm for the language.  I don't think he's reached his ceiling with respect to innovation or burn out.  Safe bet.

McKellar is a Linux kernel engineer, if I understood her bio correctly.  So here's this woman who has really made it in the industry and isn't wanting for anything accomplishment-wise or employment related.  Nonetheless, she is taking the time to get involved in outreach.  To me that says passion.  I'd like to hear what she has to say, whether she's talking about kernel hacking or outreach.

G) Conference Prep

1) Install or build Python 3.3 and dink with it.
2) Get a real database with real world data that mean something to me set up on Postgresql on my OpenBSD laptop.
3) Brush up on Unicode and Python's Unicode capabilities in preparation for Peterson's talk.

That's about it.  See you at the con.

Wednesday, November 14, 2012

The Ada Initiative - why I donated

This post has little to do with Python, but a fair bit to do with the programming community.

I donated to the Ada Initiative recently.  The option I went with was the $1/day for a year, or $365.  In the receipt one of the officers of the organization left the note, "If you have a minute, can you let your friends know why you donated?  People usually donate because someone they trust supports the Ada Initiative."  Hopefully this post does something postitive to promote the organization which promotes women in computing.

First the disclaimers.  Do I agree with all the members of the Ada Initiative on everything including their approaches to promoting women in computing?  Have I been a model citizen and will I behave angelically and perfectly with regard to the promotion of women in computing?  Not entirely and probably not, respectively.

That said, I think the Ada Initiative has the best chance to make a difference in this realm.  I saw that Mary Gardiner is involved.  I don't know her personally, but I do know that she has done some quality work reviewing Pycon papers for the Python community.  Her online communication always impressed me as measured and intelligent.  Smart and not a jerk go a long way in my book.  Her participation in the project gives me a warm and fuzzy that my money will be put to good use.  Someone with a track record of working hard and producing results is involved.  Further, I believe from tweets that Leigh Honeywell is a supporter of the organization.  I met her at Pycon a few years back and have followed some of her online activity.  Her promotion of the cause lends even more credibility in my eyes.  The woman who dealt with my donation and setting up the matching grant from my employer, Valerie Aurora, was very helpful and professional.  All in all, I have very strong confidence that these dedicated, hard working women will affect good in the world, or at least give it a damn good effort.

Lastly, there are personal reasons.  As some of you may know, I lost my wife earlier this year.  My wife was a Unix shell scripter back in the early 90's.  She put up with a lot of nonsense at work.  An incident that sticks in my head is a Christmas party where a coworker came up to her and arrogantly and dismissively said, "Nice t*ts."  That sort of thing just isn't right and did contribute to her leaving the field.  It's hard to explain, but when you are faced with mortality issues, you end up rehashing a lot of things in your head.  In the end you're generally a lot less timid about doing things you always wanted to or thought you should do.

Well,  that's the story.

Sunday, July 3, 2011

pyeuclid, vector math, and polygon offset

I had previously done a series of posts on polygon offset intended as a practical guide for accomplishing the task quickly.  Some kind folks pointed out that I was making things considerably harder than necessary by using trigonometric functions when vector math would be easier and less error prone.

A coworker lent me his Java code that does polygon offset.  I translated it into python (2.5) using the pyeuclid module:

import euclid as eu
import copy

OFFSET = 0.15

# coordinates
             # PT 1
MONASTERY = [(1.1, 0.75),
             # PT 2
             (1.2, 1.95),
             . . .
             # PT 21
             (1.1, 0.75)]

def scaleadd(origin, offset, vectorx):
    """
    From a vector representing the origin,
    a scalar offset, and a vector, returns
    a Vector3 object representing a point 

    offset from the origin.

    (Multiply vectorx by offset and add to origin.)
    """
    multx = vectorx * offset
    return multx + origin

def getinsetpoint(pt1, pt2, pt3):
    """
    Given three points that form a corner (pt1, pt2, pt3),
    returns a point offset distance OFFSET to the right
    of the path formed by pt1-pt2-pt3.

    pt1, pt2, and pt3 are two tuples.

    Returns a Vector3 object.
    """
    origin = eu.Vector3(pt2[0], pt2[1], 0.0)
    v1 = eu.Vector3(pt1[0] - pt2[0], 

                    pt1[1] - pt2[1], 0.0)
    v1.normalize()
    v2 = eu.Vector3(pt3[0] - pt2[0], 

                    pt3[1] - pt2[1], 0.0)
    v2.normalize()
    v3 = copy.copy(v1)
    v1 = v1.cross(v2)
    v3 += v2
    if v1.z < 0.0:
        retval = scaleadd(origin, -OFFSET, v3)
    else:
        retval = scaleadd(origin, OFFSET, v3)
    return retval

polyinset = []
lenpolygon = len(MONASTERY)
i = 0
poly = MONASTERY
while i < lenpolygon - 2:
    polyinset.append(getinsetpoint(poly[i], 

                 poly[i + 1], poly[i + 2]))
    i += 1
polyinset.append(getinsetpoint(poly[-2], 

             poly[0], poly[1]))
polyinset.append(getinsetpoint(poly[0], 

             poly[1], poly[2]))

The result: