A couple months back I had a one-off assignment to dump some data from a vendor provided relational database to a csv file and then from there to Excel (essentially a fairly simple ETL - extract, transform, load exercise). It was a little trickier than I had planned it. Disclaimer: this may not be the best approach, but it worked . . . at least twice . . . on two different computers and that was sufficient.
Background:
Database: the relational database provided by the vendor is the back end to a graphic mine planning application. It does a good job of storing geologic and mine planning data, but requires a little work to extract the data via SQL queries.
Weighted Averages: specifically, the queries are required to do tonne-weighted averages and binning. Two areas that I've worked in, mine planning and mineral processing (mineral processing could be considered a subset of metallurgy or chemical engineering), require a lot of work with weighted averages. Many of the database programming examples on line deal with retail and focus one sales in the form of sum of sales by location. The weighted average by tonnes or gallons of flow requires a bit more SQL code.
Breaking Up the SQL and the CSV Dump Problem: in order to break the weighted average and any associated binning into smaller, manageable chunks of functionality, I used MSSQL (Microsoft SQL Server) global temporary tables in my queries. Having my final result set in one of these global temporary tables allowed me to dump it to a csv file using the MSSQL bcp utility. There are other ways to get a result set and produce a csv file from it with Python. I wanted to isolate as much functionality within the MSSQL database as possible. Also, the bcp utility gives some feedback when it fails - this made debugging or troubleshooting the one off script easier, for me, at least.
As far as the SQL goes, I may have been able to do this with a single query without too much trouble. There are tools within Transact-SQL for pivoting data and doing the sort of things I naively and crudely do with temporary tables. That said, in real life, the data are seldom this simple and this clean. There are far more permutations and exceptions. The real life version of this problem has fourteen temporary tables versus the four shown here.
Sanitized Mock Up Scenario: there's no need to go into depth on our vendor's database schema or the specific technical problem - both are a tad complicated. I like doing tonne-weighted averages with code but it's not everyone's cup of tea. In the interest of simplifying this whole thing and making it more fun, I've based it on the old
Star Trek Episode Devil in the Dark about an underground mine on a distant planet.
Mock Data: we're modeling mined out areas and associated tonnages of rock bearing pergium, gold, and platinum in economic concentrations. (I don't know what pergium is, but it was worth enough that going to war with Mother Horta seemed like a good idea). Here is some code to create the tables and fill in the data (highly simplified schema - each mined out area is a "cut").
SQL Server 2008 R2 (Express) - table creation and mock data SQL code . I'm not showing the autogenerated db creation code - it's lengthly - suffice it to say the database name is JanusVIPergiumMine. Also, there are no keys in the tables for the sake of simplicity.
USE JanusVIPergiumMine;
CREATE TABLE cuts (
cutid INT,
cutname VARCHAR(50),
monthx VARCHAR(30),
yearx INT);
CREATE TABLE cutattributes (
cutid INT,
attributex VARCHAR(50),
valuex VARCHAR(50));
CREATE TABLE tonnes(
cutid INT NULL,
tonnes FLOAT);
CREATE TABLE dbo.gradesx(
cutid int NULL,
gradename varchar(50) NULL,
gradex float NULL);
DELETE FROM cuts;
INSERT INTO cuts
VALUES (1, 'HappyPergium1', 'April', 2015),
(2, 'HappyPergium12', 'April', 2015),
(3, 'VaultofTomorrow1', 'April', 2015),
(4, 'VaultofTomorrow2', 'April', 2015),
(5, 'Children1', 'April', 2015),
(6, 'Children2', 'April', 2015),
(7, 'VandenbergsFind1', 'April', 2015),
(8, 'VandenbergsFind2', 'April', 2015);
DELETE FROM cutattributes;
INSERT INTO cutattributes
VALUES (1, 'Drift', 'Level23East'),
(2, 'Drift', 'Level23East'),
(3, 'Drift', 'Level23West'),
(4, 'Drift', 'Level23West'),
(5, 'Drift', 'BabyHortasCutEast'),
(6, 'Drift', 'BabyHortasCutEast'),
(7, 'Drift', 'BabyHortasCutWest'),
(8, 'Drift', 'BabyHortasCutWest');
DELETE FROM tonnes;
INSERT INTO tonnes
VALUES (1, 28437.0),
(2, 13296.0),
(3, 13222.0),
(4, 6473.0),
(5, 6744.0),
(6, 8729.0),
(7, 10030.0),
(8, 2345.0);
DELETE FROM gradesx;
INSERT INTO gradesx
VALUES (1, 'Au g/tonne', 6.44),
(1, 'Pt g/tonne', 0.54),
(1, 'Pergium g/tonne', 15.23),
(2, 'Au g/tonne', 7.83),
(2, 'Pt g/tonne', 0.77),
(2, 'Pergium g/tonne', 4.22),
(3, 'Au g/tonne', 0.44),
(3, 'Pt g/tonne', 3.54),
(3, 'Pergium g/tonne', 2.72),
(4, 'Au g/tonne', 0.87),
(4, 'Pt g/tonne', 2.87),
(4, 'Pergium g/tonne', 1.11),
(5, 'Au g/tonne', 12.03),
(5, 'Pt g/tonne', 0.33),
(5, 'Pergium g/tonne', 10.01),
(6, 'Au g/tonne', 8.72),
(6, 'Pt g/tonne', 1.38),
(6, 'Pergium g/tonne', 5.44),
(7, 'Au g/tonne', 7.37),
(7, 'Pt g/tonne', 1.59),
(7, 'Pergium g/tonne', 4.05),
(8, 'Au g/tonne', 3.33),
(8, 'Pt g/tonne', 0.98),
(8, 'Pergium g/tonne', 3.99);
Python Code to Run the Dump/ETL to CSV: this is essentially a series of os.system calls to MSSQL's sqlcmd and bcp. What made this particularly brittle and hairy is the manner in which the lifetime of temporary tables is determined in MSSQL. To get the temporary table with my results to persist, I had to wrap its creation inside a process. I'm ignorant as to the internal workings of buffers and memory here, but the MSSQL sqlcmd commands do not execute or write to disk exactly when you might expect them to. Nothing is really completed until the process hosting sqlcmd is killed.
At work I actually got the bcp format file generated on the fly - I wasn't able to reproduce this behavior for this mock exercise. Instead, I generated a bcp format file for the target table dump "by hand" and put the file in my working directory.
As I show further on, this SQL data dump will be run from a button within an Excel spreadsheet.
Mr. Spock, or better said, Horta Mother says it best:
Subprocesses, sqlcmd, bcp, Excel . . .
PAAAAAIIIIIIIN!
#!C:\Python34\python
# blogsqlcmdpull.py
# XXX
# Changed my laptop's name to MYLAPTOP.
# Yours will be whatever your computer
# name is.
import os
import subprocess as subx
import shlex
import time
import argparse
# Need to make sure you are in proper Windows directory.
# Can vary from machine to machine based on
# environment variables.
# Googled StackOverflow.
# 5137497/find-current-directory-and-files-directory
EXCELDIR = os.path.dirname(os.path.realpath(__file__))
os.chdir(EXCELDIR)
print('\nCurrent directory is {:s}'.format(os.getcwd()))
parser = argparse.ArgumentParser()
# 7 digit argument like 'Apr2015'
# Feed in at command line
parser.add_argument('monthyear',
help='seven digit, month abbreviation (Apr2015)',
type=str)
args = parser.parse_args()
MONTHYEAR = args.monthyear
# Use Peoplesoft/company id so that more than
# one user can run this at once if necessary
# (note: will not work if one user tries to
# run multiple instances at the same
# time - theoretically <not tested>
# tables will get mangled and data
# will be corrupt.)
USER = os.getlogin()
CSVDUMPNAME = 'csvdumpname'
CSVDUMP = 'nohandjamovnumbersbcp'
CSVEXT = '.csv'
HOMESERVERNAME = 'homeservername'
LOCALSERVER = r'MYLAPTOP\SQLEXPRESS'
USERNAME = 'username'
# Need to fill in month, year
# with input from Excel spreadsheet.
QUERYDICT = {'month':"'{:s}'",
'year':0,
USERNAME:USER}
# For sqlcmd and bcp
ERRORFILENAME = 'errorfilename'
STDOUTFILENAME = 'stdoutfilename'
ERRX = 'sqlcmderroutput.txt'
STDOUTX = 'sqcmdoutput.txt'
EXIT = '\nexit\n'
UTF8 = 'utf-8'
GOX = '\nGO\n'
# 2 second pause.
PAUSEX = 2
SLEEPING = '\nsleeping {pause:d} seconds . . .\n'
# XXX - Had to generate this bcp format file
# from table in MSSQL Management Studio -
# dos command line:
# bcp ##TARGETX format nul -f test.fmt -S MYLAPTOP\SQLEXPRESS -t , -c -T
# XXX - you can programmatically extract
# column names from the bcp format
# file or
# you can dump them from SQLServer
# with a separate query in bcp -
# I have done neither here
# (I hardcoded them).
FMTFILE = 'formatfile'
COLBCPFMTFILE = 'bcp.fmt'
CMDLINEDICT = {HOMESERVERNAME:LOCALSERVER,
'exit':EXIT,
CSVDUMPNAME:CSVDUMP,
ERRORFILENAME:ERRX,
STDOUTFILENAME:STDOUTX,
'go':GOX,
USERNAME:USER,
'pause':PAUSEX,
FMTFILE:COLBCPFMTFILE}
# Startup for sqlcmd interactive mode.
SQLPATH = r'C:\Program Files\Microsoft SQL Server'
SQLPATH += r'\100\Tools\Binn\SQLCMD.exe'
SQLCMDEXE = [SQLPATH]
SQLCMDARGS = shlex.split(
('-S{homeservername:s}'.format**CMDLINEDICT)),
posix=False)
SQLCMDEXE.extend(SQLCMDARGS)
BCPSTR = ':!!bcp "SELECT * FROM ##TARGETX{username:s};" '
BCPSTR += 'queryout {csvdumpname:s}.csv -t , '
BCPSTR += '-f {formatfile:s} -S {homeservername:s} -T'
BCPSTR = BCPSTR.format(**CMDLINEDICT)
def cleanslate():
"""
Delete files from previous runs.
"""
# XXX - only one file right now.
files = [CSVDUMP + CSVEXT]
for filex in files:
if os.path.exists(filex) and os.path.isfile(filex):
os.remove(filex)
return 0
MONTHS = {'Jan':'January',
'Feb':'February',
'Mar':'March',
'Apr':'April',
'May':'May',
'Jun':'June',
'Jul':'July',
'Aug':'August',
'Sep':'September',
'Oct':'October',
'Nov':'November',
'Dec':'December'}
def parseworkbookname():
"""
Get month (string) and year (integer)
from name of workbook (Apr2015).
Return as month, year 2 tuple.
"""
# XXX
# Write this out - will eventually
# need error checking/try-catch
monthx = MONTHS[MONTHYEAR[:3]]
yearx = int(MONTHYEAR[3:])
return monthx, yearx
# Global Temporary Tables
TONNESTEMPTBL = """
CREATE TABLE ##TONNES{username:s} (
yearx INT,
monthx VARCHAR(30),
cutid INTEGER,
drift VARCHAR(30),
tonnes FLOAT);
"""
FILLTONNES = """
USE JanusVIPergiumMine;
DECLARE @DRIFT CHAR(5) = 'Drift';
INSERT INTO ##TONNES{username:s}
SELECT cutx.yearx,
cutx.monthx,
cutx.cutid,
cutattrx.valuex AS drift,
tonnesx.tonnes
FROM cuts cutx
INNER JOIN cutattributes cutattrx
ON cutx.cutid = cutattrx.cutid
INNER JOIN tonnes tonnesx
ON cutx.cutid = tonnesx.cutid
WHERE cutx.yearx = {year:d} AND
cutx.monthx = {month:s} AND
cutattrx.attributex = @DRIFT;
"""
GRADESTEMPTBL = """
CREATE TABLE ##GRADES{username:s} (
cutid INTEGER,
drift VARCHAR(30),
gradenamex VARCHAR(50),
graden FLOAT);
"""
FILLGRADES = """
USE JanusVIPergiumMine;
DECLARE @DRIFT CHAR(5) = 'Drift';
INSERT INTO ##GRADES{username:s}
SELECT cutx.cutid,
cutattrx.valuex AS drift,
gradesx.gradename,
gradesx.gradex
FROM cuts cutx
INNER JOIN cutattributes cutattrx
ON cutx.cutid = cutattrx.cutid
INNER JOIN gradesx
ON cutx.cutid = gradesx.cutid
WHERE cutx.yearx = {year:d} AND
cutx.monthx = {month:s} AND
cutattrx.attributex = @DRIFT;
"""
# Sum and tonne-weighted averages
MONTHLYPRODDATASETTEMPTBL = """
CREATE TABLE ##MONTHLYPRODDATASET{username:s} (
yearx INT,
monthx VARCHAR(30),
drift VARCHAR(30),
tonnes FLOAT,
gradename VARCHAR(50),
grade FLOAT);
"""
FILLMONTHLYPRODDATASET = """
INSERT INTO ##MONTHLYPRODDATASET{username:s}
SELECT tonnesx.yearx,
tonnesx.monthx,
tonnesx.drift,
SUM(tonnesx.tonnes) AS tonnes,
gradesx.gradenamex AS gradename,
SUM(tonnesx.tonnes * gradesx.graden)/
SUM(tonnesx.tonnes) AS graden
FROM ##TONNES{username:s} tonnesx
INNER JOIN ##GRADES{username:s} gradesx
ON tonnesx.cutid = gradesx.cutid
GROUP BY tonnesx.yearx,
tonnesx.monthx,
tonnesx.drift,
gradesx.gradenamex;
"""
# Pivot
TARGETXTEMPTBL = """
CREATE TABLE ##TARGETX{username:s} (
yearx INT,
monthx VARCHAR(30),
drift VARCHAR(30),
tonnes FLOAT,
pergium FLOAT,
Au FLOAT,
Pt FLOAT);
"""
FILLTARGETX = """
DECLARE @PERGIUM CHAR(15) = 'Pergium g/tonne';
DECLARE @GOLD CHAR(10) = 'Au g/tonne';
DECLARE @PLATINUM CHAR(10) = 'Pt g/tonne';
INSERT INTO ##TARGETX{username:s}
SELECT mpds.yearx,
mpds.monthx,
mpds.drift,
MAX(mpds.tonnes) AS tonnes,
MAX(perg.grade) AS pergium,
MAX(au.grade) AS Au,
MAX(pt.grade) AS Pt
FROM ##MONTHLYPRODDATASET{username:s} mpds
INNER JOIN ##MONTHLYPRODDATASET{username:s} perg
ON perg.drift = mpds.drift AND
perg.gradename = @PERGIUM
INNER JOIN ##MONTHLYPRODDATASET{username:s} au
ON au.drift = mpds.drift AND
au.gradename = @GOLD
INNER JOIN ##MONTHLYPRODDATASET{username:s} pt
ON pt.drift = mpds.drift AND
pt.gradename = @PLATINUM
GROUP BY mpds.yearx,
mpds.monthx,
mpds.drift
ORDER BY mpds.drift;
"""
# 1) Create global temp tables.
# 2) Fill global temp tables.
# 3) Get desired result set into the target global temp table.
# 4) Run bcp against target global temp table.
# 5) Drop global temp tables.
CREATETABLES = {1:TONNESTEMPTBL,
2:GRADESTEMPTBL,
3:MONTHLYPRODDATASETTEMPTBL,
4:TARGETXTEMPTBL}
FILLTABLES = {1:FILLTONNES,
2:FILLGRADES,
3:FILLMONTHLYPRODDATASET,
4:FILLTARGETX}
def getdataincsvformat():
"""
Retrieve data from MSSQL server.
Dump into csv text file.
"""
numtables = len(CREATETABLES)
with open('{errorfilename:s}'.format(**CMDLINEDICT), 'w') as e:
with open('{stdoutfilename:s}'.format(**CMDLINEDICT), 'w') as f:
sqlcmdproc = subx.Popen(SQLCMDEXE, stdin=subx.PIPE,
stdout=f, stderr=e)
for i in range(numtables):
cmdx = (CREATETABLES[i + 1]).format(**QUERYDICT)
print(cmdx)
sqlcmdproc.stdin.write(bytes(cmdx +
'{go:s}'.format(**CMDLINEDICT), UTF8))
print(SLEEPING.format(**CMDLINEDICT))
time.sleep(PAUSEX)
for i in range(numtables):
cmdx = (FILLTABLES[i + 1]).format(**QUERYDICT)
print(cmdx)
sqlcmdproc.stdin.write(bytes(cmdx +
'{go:s}'.format(**CMDLINEDICT), UTF8))
print(SLEEPING.format(**CMDLINEDICT))
time.sleep(PAUSEX)
print('bcp csv dump command (from inside sqlcmd) . . .')
sqlcmdproc.stdin.write(bytes(BCPSTR, UTF8))
print(SLEEPING.format(**CMDLINEDICT))
time.sleep(PAUSEX)
sqlcmdproc.stdin.write(bytes('{exit:s}'.format(**CMDLINEDICT), UTF8))
return 0
monthx, yearx = parseworkbookname()
# Get rid of previous files.
print('\ndeleting files from previous runs . . .\n')
cleanslate()
# Get month and year into query dictionary.
QUERYDICT['month'] = QUERYDICT['month'].format(monthx)
QUERYDICT['year'] = yearx
getdataincsvformat()
print('done')
It's ugly, but it works.
Keeping with the Horta theme, this would be a good spot for an image break:
Damnit, Jim, I'm a geologist not a database programmer.
You're an analyst, analyze.
Load to Excel: this is fairly straightforward - COM programming with Mark Hammond and company's venerable win32com. The only working version of the win32com library I had on my laptop on which I am writing this blog entry was for a Python 2.5 release that came with an old version of our mine planning software (
MineSight/Hexagon) - the show must go on!
#!C:\MineSight\mpython
# blognohandjamnumberspython2.5.py
# mpython is Python 2.5 on this machine.
# Had to remove collections.namedtuple
# (used dictionary instead) and new
# string formatting (reverted to use
# of ampersand for string interpolation).
# Lastly, did not have argparse at my
# disposal.
from __future__ import with_statement
"""
Get numbers into spreadsheet
without having to hand jam
everything.
"""
import os
from win32com.client import Dispatch
# Plan on receiving Excel file's
# path from call from Excel workbook.
import sys
# Path to Excel workbook.
WB = sys.argv[1]
# Worksheet name.
WSNAME = sys.argv[2]
BACKSLASH = '\\'
# Looking for data file in current directory.
# (same directory as Python script)
CSVDUMP = 'nohandjamovnumbersbcp.csv'
# XXX - repeated code from data dump file.
CURDIR = os.path.dirname(os.path.realpath(__file__))
os.chdir(CURDIR)
print('\nCurrent directory is %s' % os.getcwd())
# XXX - I think there's a more elegant way to
# do this path concatenation with os.path.
CSVPATH = CURDIR + BACKSLASH + CSVDUMP
# Fields in csv dump.
YEARX = 'yearx'
MONTHX = 'monthx'
DRIFT = 'drift'
TONNES = 'tonnes'
PERGIUM = 'pergium'
GOLD = 'Au'
PLATINUM = 'Pt'
FIELDS = [YEARX,
MONTHX,
DRIFT,
TONNES,
PERGIUM,
GOLD,
PLATINUM]
# Excel cells.
# Map this to csv dump and brute force cycle to fill in.
ROWCOL = '%s%d'
COLUMNMAP = dict((namex, colx) for namex, colx in
zip(FIELDS, ['A', 'B', 'C', 'D',
'E', 'F', 'G']))
EXCELX = 'Excel.Application'
def getcsvdata():
"""
Puts csv data (CMP dump) into
a list of data structures
and returns list.
"""
with open(CSVPATH, 'r') as f:
records = []
for linex in f:
# XXX - print for debugging/information
print([n.strip() for n in linex.split(',')])
records.append(dict(zip(FIELDS,
(n.strip() for n
in linex.split(',')))))
return records
# Put Excel stuff here.
def getworkbook(workbooks):
"""
Get handle to desired workbook
"""
for x in workbooks:
print(x.FullName)
if x.FullName == WB:
# XXX - debug/information print statement
print('EUREKA')
break
return x
def fillinspreadsheet(records):
"""
Fill in numbers in spreadsheet.
Side effect function.
records is a list of named tuples.
"""
excelx = Dispatch(EXCELX)
wb = getworkbook(excelx.Workbooks)
ws = wb.Worksheets.Item(WSNAME)
# Start entering data at row 4.
row = 4
for recordx in records:
for x in FIELDS:
column = COLUMNMAP[x]
valuex = recordx[x]
cellx = ws.Range(ROWCOL % (column, row))
# Selection makes pasting of new value visible.
# I like this - not everyone does. YMMV
cellx.Select()
cellx.Value = valuex
# On to the next record on the next row.
row += 1
# Come back to origin of worksheet at end.
ws.Range('A1').Select()
return 0
cmprecords = getcsvdata()
fillinspreadsheet(cmprecords)
print('done')
On to the VBA code inside the Excel spreadsheet (macros) that execute the Python code:
Option Explicit
Const EXECX = "C:\Python34\python "
Const EXECXII = "C:\MineSight\mpython\python\2.5\python "
Const EXCELSCRIPT = "blognohandjamnumberspython2.5.py "
Const SQLSCRIPT = "blogsqlcmdpull.py "
Sub FillInNumbers()
Dim namex As String
Dim wb As Workbook
Dim ws As Worksheet
Dim longexecstr As String
Set ws = Selection.Worksheet
'Try to get current worksheet name to feed values to query.
namex = ws.Name
longexecstr = EXECXII & " " & ActiveWorkbook.Path
longexecstr = longexecstr & Chr(92) & EXCELSCRIPT
longexecstr = longexecstr & ActiveWorkbook.Path & Chr(92) & ActiveWorkbook.Name
longexecstr = longexecstr & " " & namex
VBA.Interaction.Shell longexecstr, vbNormalFocus
End Sub
Sub GetSQLData()
Dim namex As String
Dim ws As Worksheet
Set ws = Selection.Worksheet
'Try to get current worksheet name to feed values to query.
namex = ws.Name
VBA.Interaction.Shell EXECX & ActiveWorkbook.Path & _
Chr(92) & SQLSCRIPT & namex, vbNormalFocus
End Sub
I always use Option Explicit in my VBA code - that's not particularly pythonic, but being pythonic inside the VBA interpreter can be hazardous. As always, YMMV.
Lastly, a rough demo and a data check. We'll run the SQL dump from the top button on the Excel worksheet:
And now we'll run the lower button to put the data into the spreadsheet. It's probably worth noting here that I did not bother doing any type conversions on the text coming out of the SQL csv dump in my Python code. That's because Excel handles that for you. It's not free software (Excel/Office) - might as well get your money's worth.
We'll do a check on the first row for tonnes and a pergium grade. Going back to our original data:
Cuts 1 and 2 belong to the drift Level23East.
Tonnes:
VALUES (1, 28437.0),
(2, 13296.0),
Total: 41733
Looks good, we know we got a sum of tonnes right. Now the tonne-weighted average:
Pergium:
(1, 'Pergium g/tonne', 15.23),
(2, 'Pergium g/tonne', 4.22),
(28437 * 15.23 + 13296 * 4.22)/41733 = 11.722
It checks out. Do a few more checks and send it out to the Janus VI Pergium Mine mine manager.
Notes:
This is a messy one-off mousetrap. That said, this is often how the sausage gets made in a non-programming, non-professional development environment. We do have an in-house Python developer Lori. Often she's given something like this and told to clean it up and make it into an in-house app. That's challenging. Ideally, the mining professional writing the one-off and the dev get together and cross-educate vis a vis the domain space (mining) and the developer space (programming, good software design and practice). It's a lot of fun but the first go around is seldom pretty.
Thanks for stopping by.
Leonard Nimoy
1931 - 2015