Friday, December 11, 2015

Improved Storing and Displaying Images in Postgresql - bytea

Last post I brute forced the storage of binary image (jpeg) data as text in a Postgresql database, and accordingly brute forced the data's display in the Unix image viewer feh from output from a psql query.  It was hackish and I received some negative, but good constructive criticism on how to improve it:

1) use Python's base64 module instead of the binascii one.

2) use bytea as a storage type in Postgresql instead of text.

Marius Gedminus made the base64.b64encode suggestion for text.  It does make for a little less storage space.  Ultimately we won't go with this solution because we want to go with bytea, the Postgresql data type intended for this type of data.  But for completeness, here is what a base64.b64encode text solution would look like:

$ python3.5
Python 3.5.0 (default, Oct 23 2015, 21:23:18)
[GCC 4.2.1 20070719 ] on openbsd5
Type "help", "copyright", "credits" or "license" for more information.)
>>> import base64
>>> f = open('prrrailwhaletankcar.jpg', 'rb')
>>> bindata =
>>> f.close()
>>> b64data = str(base64.b64encode(bindata))
>>> # Converted data to string for write with csv file
>>> # to database table text field.
>>> # The string representation of BASE 64 includes the
>>> # letter b and single quotes.
>>> b64data[:10]
>>> b64data[-10:]
>>> b64data[1]
>>> b64data[-1]
>>> # Isolate the BASE 64 digits with the quotes included.
>>> substrx = b64data[1:]
>>> picdata = base64.b64decode(substrx)
>>> f = open('test.jpg', 'wb')
>>> f.write(picdata)
>>> f.close()
>>> len(substrx)
>>> # BASE 64 string is 1 1/3 times as big as the
>>> # binary data it represents.
>>> _/187810
>>> # Taking off the quote marks doesn't inhibit the
>>> # decoding of the BASE64 string at all - probably
>>> # best to go with this less is more approach.
subsubstrx = substrx[1:-1]
>>> picdata = base64.b64decode(subsubstrx)
>>> f = open('test2.jpg', 'wb')
>>> f.write(picdata)
>>> f.close()
>>> len(picdata)
>>> # BASE64 string ever so slightly smaller without
>>> # the quote marks (2 chars).
>>> len(subsubstrx)
>>> _/187810

>>> # Works in both cases.
>>> os.system('feh --geometry 400x300+200+200 test.jpg')

>>> os.system('feh --geometry 400x300+200+200 test2.jpg')


The results for both commands in the last lines (show picture with feh) look the same:

Storing the BASE 64 string in a Postgresql text column is the same as storing the hex one like I did in the last post.  The main thing to look out for is the proper stripping of the Python generated string for extra characters - single quotes are OK as long as they are matched on either end of the string.  As I mentioned in the code comments above, knowing what I know now, I would strip them out too even prior to storing the string in a database.

On to the Postgresql bytea storage part of the post.  Someone I respect asked me on Facebook, "Why didn't you just use bytea (for storage)?"  I had to sheepishly own up to just not being used to working with binary data (as opposed to strings) so I went with what I knew.  Shame drove me to at least attempt to do things the right way - binary storage for binary data, in this case a jpeg image.

Postgresql 9.4 uses a hex based representation (hex format) for the bytea data type by default.  It is possible to mess this up - it is covered in the doc but I didn't read it carefully enough:

If you preface your hexadecimal string with \x (single backslash) you will end up with an octal representation of your binary data (digits 0 through 7).  \\x prior to the hexadecimal string will give you what you, or at least I want, hexidecimal representation of your binary data on output.  The SQL string I used for processing my string data (already in the database from my work on the last blog post):

/* Postgresql SQL code */
CAST('\\x' || <hexadecimal string> AS bytea)

The || operator is for concatenation of strings (this is probably obvious to Postgresql and other database distro users but MSSQL uses a + symbol so it was a little new to me).

To deal with transitioning all my text picture columns to bytea I did the following:

1) create a new set of identical tables to the ones I had in the same database with new relations identical to the old ones but with the new set of tables.

2) fill the new tables in with the new data that has all the former text columns for binary as bytea.

3) delete the old tables once the new ones are filled in.

4) rename the new tables to match the names of the old ones (how I wanted the database schema to look in the first place).

Postgresql is different than MSSQL in that the database is more its own autonomous entity that needs to be connected to other databases by some introduced mechanism.  In MSSQL, databases on the same server can reference each other in queries by default.  I started looking into the Postgresql fdw (foreign data wrapper) plugin, then realized I could do this more easily with the path I took above.

It's not necessary to post all the SQL code.  I used a psql variable in my SQL for the hexadecimal data predicate to make sure I got it right each time.  From inside psql I executed the SQL files with the \i metacommand.  Here is a snippet with the variable.

/* Postgresql SQL code to be used with

   the Postgresql psql interpreter */

/* Need this for bytea conversion
   from hex string */
\set byteaidstr '\\x'

INSERT INTO locomotiveprototypes2
    SELECT keyx,
           CAST(:'byteaidstr' || picture AS bytea)
    FROM locomotiveprototypes;

The variable thing in psql takes a little getting used to but the Postgresql documentation is good about explaining when and how to use the single quote marks and where to put them.  It worked out.

The most important part:  getting the picture to show up from a psql metacommand through the use of a python script.  Here is my modified script similar to the one in my last post:


Processing of image coming out
of Postgresql query as a stream.

Deals with bytea column string
output from psql.

import base64
import sys
import subprocess

DECODED = 'decoded'

SIZEMSG = '\nsize of {0:s} output = {1:d}\n'
SIZERATIOMSG = '\nsize of {0:s} output/size of binary output = {1:05.5f}\n'

# Want to avoid '\\x' in query output.

FEHCMD = ['feh', '--geometry', '400x300+200+200', '-']

BYTEAFMT = 'bytea hex format'

# 2 variables track changes in size of
#     hex output from query in psql.
sizex = 0
lenxbin = 0

# Feeding to script straight from
# psql \copy metacommand.
inputx =
sizex = len(inputx)

# print's are mainly for flagging when something goes wrong.
#     aka debugging

# -1 index in slice chops off the return character '\n'
# Need casefold=True to deal with lower case from Postgresql.
binx = base64.b16decode(inputx[STARTINDEX:-1], casefold=True)
lenxbin = len(binx)

# print's highlight size relationship between
#     hex representation and actual binary data.
print(SIZEMSG.format(BYTEAFMT , sizex))
print(SIZEMSG.format(DECODED, lenxbin))
print(SIZERATIOMSG.format(BYTEAFMT, sizex/lenxbin))

# Pops up picture on screen., input=binx)


An important change I made from last time is fixing the call to the image viewer feh to eliminate all that hacky intermediate writing of a jpeg file that took forever (in computer time).  It turns out feh accepts binary input from a pipe or stdin just fine - I just needed to read the man page more thoroughly.

Now to see if this works:

$ psql hotrains carl
Password for user carl:
psql (9.4.4)
Type "help" for help.

hotrains=# \copy (SELECT picture FROM locomotiveprototypes WHERE keyx = 3) to program ''

size of bytea hex format output = 1081720

size of decoded output = 540858

size of bytea hex format output/size of binary output = 2.00001

And we're good to go.

Thanks for stopping by.

Saturday, October 17, 2015

Storing and Displaying Images in Postgresql

Last post I set up a toy (literally) Postgresql database for my model train car collection.  A big part of the utility of the database is its ability to store images (pictures or photos) of the real life prototype and model train cars.  Postgresql (based on my google research) offers a couple methods of doing that.  I'll present how I accomplished this here.  The method I chose suited my home needs.  For a commercial or large scale project, something more efficient in the way of storage and speed of retrieval may be better.  Anyway, here goes.

I chose to store my photos as text representations of binary data in Postgresql database table columns with the text data type.  This decision was mainly based on my level of expertise and the fact that I am doing this for home use as part of a learning experience.  Storing the binary data as text inflates their size by a factor of two - very inefficient for storage.  For home use in a small database like mine, storage is hardly an issue.  At work I transfer a lot of binary data (3 dimensional mesh mined solids) to a distant server in text format using MSSQL's bcp.  Postgresql is a little different, but I am familiar with the general idea of stuffing a lot of text in a database column.

In order to get the data into comma delimited rows without dealing with a long, unwieldy string of text from the photos, I wrote a Python script to do it:


Prepare multiple rows of data
that includes a hexlify'd
picture for a column in
a table in the model train

import binascii
import os

UTF8 = 'utf-8'
# LATIN1 = 'latin-1'

INFOFILE = 'infoiii.csv'

PICTUREFILEFMT = '{:s}.{:s}'

JPG = 'jpg'
PNG = 'png'

COMMA = ','

PATHX = '/home/carl/postgresinstall/workpictures/multiplecars/'

PATHXOUT = PATHX + 'rows/'

PHOTOMSG = 'Now doing photo {:s} . . .'

def checkfileextension(basename):
    With the basename of an image file
    returns True for jpg and false for
    anything else (png).
    if os.path.exists(PATHX +

        PICTUREFILEFMT.format(basename, JPG)):
        return True
        return False

with open(PATHX + INFOFILE, 'r', encoding=UTF8) as finfo:
    for linex in finfo:
        newlineparts = [x.strip() for x in linex.split(COMMA)]
        photox = newlineparts.pop()
        # Check for jpg or png here
        # XXX - this could be better - could actually
        #       check and return actual extension;
        #       more code and lazy.
        extension = ''
        if checkfileextension(photox):
            extension = JPG
            extension = PNG
        with open(PATHX +

                extension), 'rb') as fphoto:
            contents = binascii.hexlify(
            liney = COMMA.join(newlineparts)
            liney += COMMA
            liney = bytes(liney, UTF8)
            liney += contents
            with open(PATHXOUT +

                ROWFILEOUTFMT.format(photox), 'wb') as frow:


The basic gist of the script is to get each photo name provided into a file that can be later imported into a table in Postgresql.  The paths in the capitalized "constants" would have to be adjusted for your situation (I tend to go overboard on capitalized constants because I'm a lousy typist and want to avoid screwing up and then having to debug my typos).  The INFOFILE referred to in the script has roughly the following format:

<column1data>, <column2data>, . . . , <photofilename>

So the idea is to take a comma delimited file, encode it in UTF-8, and stuff the binary data from the (correct) photo at the end as text.  I designed my database tables with photos (I use the column name "picture") with the text data column as the last - this is kind of a hack, but it made scripting this easier.

An example of importing one of these "row" files into the database table from within psql:

$ psql hotrains carl
Password for user carl:
psql (9.4.1)
Type "help" for help.

hotrains=# \d
                List of relations
 Schema |          Name          | Type  | Owner
 public | rollingstockprototypes | table | carl
(1 row)

hotrains=# \d rollingstockprototypes
     Table "public.rollingstockprototypes"
  Column  |          Type          | Modifiers
 namex    | character varying(50)  | not null
 photourl | character varying(150) | not null
 comments | text                   | not null
 picture  | text                   | not null
    "rsprotoname" PRIMARY KEY, btree (namex)

hotrains=# COPY rollingstockprototypes FROM '/home/carl/postgresinstall/G39Arow' (DELIMITER ',');


My Python script for actually displaying a photo or image is a little hacky in that in requires checks for the size of the output versus the size of the information pulled from the Postgresql database table.  My original script would show the picture piped to the lightweight UNIX image viewer feh as partially complete.  In order to get around this I put a timed loop in the script to check that the image data were about half of the size of the text data pulled.  It works well enough, if slowly at times:


Try to mimic processing of image
coming out of postgresql query
as a stream.

import binascii
import os
import time
import sys

import argparse

# Name of file containing psql \copy hex output (text).
HEXFILE = '/home/carl/postgresinstall/workpictures/hexoutput'

# 2.5 seconds max delay before abort.
# Enough time to write most big pixel
# jpg's, it appears.
PAUSEX = 0.25

# Argument name.
PICTURENAME = 'picturename'

parser = argparse.ArgumentParser()
args = parser.parse_args()

# Name of picture file
# written from hex query.
PICNAME = args.picturename

# Extensions feh recognizes.
PNG = 'png'
JPG = 'jpg'

FILEEXTENSIONMSG = '\nFile extension {:s} detected.\n'
UNRECOGNFILENAME = '\nUnrecognized file extension for picture '

ABORTMSG = '\nSorry, no data available for feh.  Aborting.\n'

SLEEPMSG = '\nSleeping {:2.2f} seconds . . .\n'

SIZEHEXFILEMSG = '\nsize of hex output = {:d}\n'
SIZEBINARYMSG = '\nsize of binary file = {:d}\n'
SIZERATIOMSG = '\nsize of hex output/size of binary file '

SIZERATIOMSG += '{:05.5f}\n'

ABORTMSGTOOSMALL = '\nSorry, not enough data to show a '

ABORTINGTOOSMALL += 'complete picture.  Aborting.\n'

extension = PICNAME[-3:]
if extension == PNG:
elif extension == JPG:

PICFILEFMT = '/home/carl/postgresinstall/workpictures/{:s}'
FEHFMT = 'feh -g 400x300+200+200 {:s}'

# Length of binary string.
lenx = 0
# 2 variables track changes in size of
# hex output from query in psql.
sizex = 0
sizexnew = 0
# Tracks time spent sleeping.
totaltimewait = 0.0

while totaltimewait < MAXTIME:
    # Try to make sure hex file is completely written.
    sizexnew = os.path.getsize(HEXFILE)
    if sizexnew > sizex or sizexnew == 0:
        sizex = sizexnew
        totaltimewait += PAUSEX
    elif sizexnew == sizex:
        with open(HEXFILE, 'rb') as f2:
            with open(PICFILEFMT.format(PICNAME), 'wb') as f:
                strx = binascii.unhexlify(
                lenx = len(strx)

# I don't want part of a picture.
if not (sizexnew > 0 and
        sizexnew/lenx > ACCEPTABLEHEXTOBINRATIO):

# Pops up picture on screen.


Let's see if we can get a look at this in action - example of call from within psql:

hotrains=# \copy (SELECT decode(picture, 'hex') FROM rollingstockprototypes WHERE namex = 'G-39A Ore Jenny') to program 'cat > /home/carl/postgresinstall/workpictures/hexoutput |'

And a screenshot of a (hopefully acceptable) result:

Depending on which directory I've logged into psql under, I may have to type the full paths of the output and Python file.

There is more I could do with this, but for now I'm OK with it.  Writing to a file and then checking on its size is slow.  There is probably a way to write to memory and check what's there, but I got stuck on that and decided to go with the less efficient solution.

Thanks for stopping by.

Setting Up Toy Postgresql Database on OpenBSD

This isn't a Python scripting post, but the next one will be on the same topic.  In this post I get a Postgresql database set up on my OpenBSD laptop and get familiar with the Postgresql environment.

I primarily use Microsoft SQL Server and vendor supplied database schemas at work.  I know Postgresql has a good reputation among open source databases, but I haven't had an opportunity to use it in a work environment (I had a brief brush with Jigsaw years back - a competitor to Modular's MSSQL-based Powerview (Dispatch) in pit mining truck tracking database - but that doesn't count.)

Anyway, as I've noted in previous posts, I run OpenBSD as my operating system on my laptop at home.  The OpenBSD project has a package for Postgresql.

The first order of business is to install the Postgresql server package.  First, I'll set up a PKG_PATH  FTP mirror location from within the ksh shell:

$ export PKG_PATH=

That server is the one in Boulder, Colorado - that's the one I usually use.  I'm in Tucson, Arizona in the Mountain timezone, so it kind of makes sense to use that one.  My understanding is that, in general, you want to use a mirror away from the main one to spread out the bandwidth and server use for the OpenBSD (or any other open source) project.

Now to install the package - this has to be done as root.  I use sudo for this (sudo's replacement, as I understand it, in OpenBSD 5.8 will be doas(1) although you'll still be able to get sudo(1) as a package).

$ echo $PKG_PATH

$ sudo pkg_add postgresql-server
quirks-2.54 signed on 2015-03-09T11:04:08Z
No change in quirks-2.54
postgresql-server-9.4.1p1 (extracting)
3% ********


postgresql-server-9.4.1p1 (installing)
0% useradd: Warning: home directory `/var/postgresql' doesn't exist, and -m was not specified
postgresql-server-9.4.1p1 (installing)|
3% ********



postgresql-server-9.4.1p1: ok
The following new rcscripts were installed: /etc/rc.d/postgresql
See rcctl(8) for details.
Look in /usr/local/share/doc/pkg-readmes for extra documentation.

Given an internet connection with decent speed, this all goes pretty quickly.  The first set of per cent numbers are the download of the gzippped tar package binary, the second are the unzipping and install of the Postgresql binaries in the proper location in the operating system file hierarchy.

For years I had some trouble getting my head around setting up users for Postgresql and running the daemon.  Much of my database experience is as an application user at work using Microsoft SQL Server.  We use Windows Authentication there primarily.  Working on my own UNIX-based (OpenBSD) home system is a little different.

Most of the problems I've had overcoming this user/security hump related to my lack of a good strong grasp of UNIX users and permissions (like I could do it in my sleep strong grasp).  OpenBSD is a bit unique in that it has a special name for the postgresql unprivileged user:  _postgresql.  The underscore is a convention in OpenBSD for this general class of user, usually associated with a daemon that runs on startup or gets started by root, doesn't have a login (nor a password).  Michael Lucas spends several pages with a good summary of the rational behind this, the history and its conventions in his authoritative OpenBSD book.

So, we want to take a look at the directory designated for Postgresql's data, /var/postgresql:

$ ls -lah /var | grep post

drwxr-xr-x   2 _postgresql  _postgresql   512B May 19 17:52 postgresql

$ cd postgresql

There is no data directory there (just . and .. in the /var/postgresql directory - the 2 in the ls output).  This is typically where I would get stuck in the past.  I ended up doing it manually . . . and wrong, or at least in a way that was more difficult than necessary.  Anyway, I recorded it that way, so I'll blog it as executed.

What I had difficulty understanding before was the whole unprivileged user concept.  Basically you need to use su to log on as root, then further su to log on as _postgresql:


$ su
# su - _postgresql
$ mkdir /var/postgresql/data
$ ls -lah /var/postgresql
total 12
drwxr-xr-x   3 _postgresql  _postgresql   512B Jun  4 19:06 .
drwxr-xr-x  23 root         wheel         512B May 19 17:52 ..
drwxr-xr-x   2 _postgresql  _postgresql   512B Jun  4 19:06 data
$ exit
# exit


Now I need a database cluster.  I want to initialize it with support for UTF-8 because I have some text data with umlauts in it (non-ASCII):

$ su
# su - _postgresql
$ initdb -D /var/postgresql/data -U postgres -A md5 -E UTF8 -W

The files belonging to this database system will be owned by user "_postgresql".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 30
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/postgresql/data/base/1 ... ok
initializing pg_authid ... ok
Enter new superuser password:
Enter it again:
setting password ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... not supported on this platform
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    postgres -D /var/postgresql/data
    pg_ctl -D /var/postgresql/data -l logfile start

$ exit
# exit
$ whoami
$ pwd

A couple things:

1) There's a line in the output about fixing permissions on the existing data directory (this will show up as highlighted on the blog, possibly not in the planetpython blog feed) - had I done this correctly (just let initdb make the directory itself), that line would look something like this (I created another cluster while writing the blog just so I would understand how to do it right):

creating directory /var/postgresql/data4 ... ok

Right there in the initdb(1) man page:  "Creating a database cluster consists of creating the directories in which the database date will live . . ."  The man page goes on to explain how to get around permission problems, etc. in this process.  Note to self:  read the man page . . . carefully.

2) What I also learned is that you can make as many database clusters as you want, all with different data directories.  postgres is the superuser name you see in the documentation and /var/postgresql/data is the directory, but, as demonstrated above in the output, you could put your data in a folder called data4.  If you gave a different name at the -U switch in the initdb command, the superuser name would be different too.  Or you could have more than one cluster with postgres named superusers but with different passwords.

All that said, one cluster per physical box and the conventional names are plenty for me - I'm just trying to get used to the Postgresql environment and get started.

At this point I need to start up the Postgresql daemon.  In the package install above, the output mentions an rc script /etc/rc.d/postgresql.  This is run by root - below is a demo of using it manually with su (instead of using it as part of an rc startup sequence at boot):

$ su

# /etc/rc.d/postgresql start
# pgrep postgres
# /etc/rc.d/postgresql stop                                                   
# pgrep postgres

All I did there was start the Postgresql daemon with the installed rc script, check to see that it's associated processes are running, then stop the daemon with the same script.

Me being me, I can't leave good enough alone.  I want the control of starting and stopping the daemon when I decide to (I am running this on a laptop).  As I understand it, pg_ctl is a wrapper program provided with the Postgresql install for even more low level commands and functionality.  I use pg_ctl to run the daemon and start it with the _postgresql user account:

$ su

# su - _postgresql
$ pg_ctl -D /var/postgresql/data -l firstlog start
server starting
$ exit
# exit

I asked pg_ctl to make a specific log file for this session (firstlog - this will go in directory /var/postgresql/).  The logs are human readable and I wanted to study them later to see what's going on (there's all kinds of stuff in there about autovacuum and what not - sorry, we're not covering that in this blog post - but I'll have it available later).

Shutting down (stopping) the daemon is pretty simple with pg_ctl - just a few more keystrokes than if I had done it from root with the rc script:

$ su
# su - _postgresql
$ pg_ctl -D /var/postgresql/data stop
waiting for server to shut down.... done
server stopped
$ exit
# exit
$ whoami

Great - so I'm good for getting the daemon going when I want it and for designating my own specific log files per session.  Now to create a user and get to work:

(with daemon running):

$ psql -U postgres
Password for user postgres:
psql (9.4.1)
Type "help" for help.

postgres=# CREATE ROLE carl SUPERUSER;
postgres=# ALTER ROLE carl PASSWORD 'xxxxxxxx'
postgres=# ALTER USER carl PASSWORD 'xxxxxxxx' LOGIN;
postgres=# \q


I created a user/role carl with SUPERUSER capabilities within this instance of Postgresql.  It's a bit ugly and I'm not sure I've done this correctly or the easiest way.  Also, and of importance, I have given Postgresql user carl (not OpenBSD user carl) all permissions on everything.  Really, carl only needs permissions to work on the database he's working on.  Josh Drake (@linuxhiker on twitter) pointed this out to me.  I am grateful for this.  He is right.  I am lazy.

Now to create my database.  I got into model trains around Christmas of 2015 and went crazy collecting stuff and setting up a layout.  I needed to somehow keep track of all the cars before it all got too unwieldy.

$ psql postgres carl
Password for user carl:
psql (9.4.1)
Type "help" for help.

postgres=# CREATE DATABASE hotrains;
postgres=# \q

The command line entry to start psql is something I'm a bit fuzzy on - postgres isn't, to the best of my knowledge, a database per se, but a means of connecting to psql when you don't want to designate a default database ot work on.

I'm not going to post the full database code for the sake of brevity - it's only 11 tables but that's a bit much for a blog post.  Instead I'll post a graphic schema I made and talk to it a little bit before posting one related SQL code snippet.

Disclaimer:  I'm not a designer.  This schema diagram I did with Dia, a fairly lightweight Linux/UNIX desktop tool for flowcharts and stuff.  I've never met a color palette or font choice I could simply let be.  Asking me to do a flowchart with a lot of leeway on design is like leaving a two year old home alone with a Crayola 64 pack of crayons and the 300 year old family Bible - it can't end well.

All that said, I find schema diagrams helpful for showing relationships between tables and having an ugly one is better than none at all.  I've embedded an svg version of it below; hopefully it shows up on the planetpython feed:

The focus of my crude toy database design was the use of foreign keys to maintain consistency in naming things I want to track:  rail name for example.  I went with "Santa Fe" where I could have went with (and probably should have) "ATSF."  It doesn't matter as long as it's consistent and I know what it means.

Years ago I was called in to do some work on a blasting database at the mine.  There weren't any constraints on the entry of names of blasting materials, but what could go wrong?  There were only three or four products with four digit designators and "None."  Well . . . it was a mess.  I didn't want to take any chances on having a situation like that again, even, or especially, if I was doing all of the data entry.  Foreign keys it was!

Here's a quick dump of the code I used to create the validsidenumbers table.  The idea is to make sure the rail line or company name is consistent in all the side number records (yes, I did actually purchase some identical rolling stock with the exact same side numbers - it's a long story):

hotrains=# CREATE TABLE validsidenumbers (
railnamex        varchar(50) REFERENCES validraillines (namex),
sidenumber       varchar(50),
comments         text,
PRIMARY KEY (railnamex, sidenumber)

That REFERENCES keyword sees to it that I won't enter anything typo'd or goofy into that railnamex column.

Next post is a Python one about pulling storing images of the train cars in the database and displaying them from within psql.

Thanks for stopping by.

Saturday, September 26, 2015

MSSQL sqlcmd -> bcp csv dump -> Excel

A couple months back I had a one-off assignment to dump some data from a vendor provided relational database to a csv file and then from there to Excel (essentially a fairly simple ETL - extract, transform, load exercise).  It was a little trickier than I had planned it.  Disclaimer:  this may not be the best approach, but it worked . . . at least twice . . . on two different computers and that was sufficient.


Database:  the relational database provided by the vendor is the back end to a graphic mine planning application.  It does a good  job of storing geologic and mine planning data, but requires a little work to extract the data via SQL queries. 

Weighted Averages:  specifically, the queries are required to do tonne-weighted averages and binning.  Two areas that I've worked in, mine planning and mineral processing (mineral processing could be considered a subset of metallurgy or chemical engineering), require a lot of work with weighted averages.  Many of the database programming examples on line deal with retail and focus one sales in the form of sum of sales by location.  The weighted average by tonnes or gallons of flow requires a bit more SQL code.

Breaking Up the SQL and the CSV Dump Problem:  in order to break the weighted average and any associated binning into smaller, manageable chunks of functionality, I used MSSQL (Microsoft SQL Server) global temporary tables in my queries.  Having my final result set in one of these global temporary tables allowed me to dump it to a csv file using the MSSQL bcp utility.  There are other ways to get a result set and produce a csv file from it with Python.  I wanted to isolate as much functionality within the MSSQL database as possible.  Also, the bcp utility gives some feedback when it fails - this made debugging or troubleshooting the one off script easier, for me, at least.

As far as the SQL goes, I may have been able to do this with a single query without too much trouble.  There are tools within Transact-SQL for pivoting data and doing the sort of things I naively and crudely do with temporary tables.  That said, in real life, the data are seldom this simple and this clean.  There are far more permutations and exceptions.  The real life version of this problem has fourteen temporary tables versus the four shown here.

Sanitized Mock Up Scenario:  there's no need to go into depth on our vendor's database schema or the specific technical problem - both are a tad complicated.  I like doing tonne-weighted averages with code but it's not everyone's cup of tea.  In the interest of simplifying this whole thing and making it more fun, I've based it on the old Star Trek Episode Devil in the Dark about an underground mine on a distant planet.

Mock Data:  we're modeling mined out areas and associated tonnages of rock bearing pergium, gold, and platinum in economic concentrations.  (I don't know what pergium is, but it was worth enough that going to war with Mother Horta seemed like a good idea).  Here is some code to create the tables and fill in the data (highly simplified schema - each mined out area is a "cut").

SQL Server 2008 R2 (Express) - table creation and mock data SQL code .  I'm not showing the autogenerated db creation code - it's lengthly - suffice it to say the database name is JanusVIPergiumMine.  Also, there are no keys in the tables for the sake of simplicity.

USE JanusVIPergiumMine;

    cutid INT,
    cutname VARCHAR(50),
    monthx VARCHAR(30),
    yearx INT);

CREATE TABLE cutattributes (
    cutid INT,
    attributex VARCHAR(50),
    valuex VARCHAR(50));

    cutid INT NULL,
    tonnes FLOAT);

CREATE TABLE dbo.gradesx(
 cutid int NULL,
 gradename varchar(50) NULL,
 gradex float NULL);


    VALUES (1, 'HappyPergium1', 'April', 2015),
           (2, 'HappyPergium12', 'April', 2015),
           (3, 'VaultofTomorrow1', 'April', 2015),
           (4, 'VaultofTomorrow2', 'April', 2015),
           (5, 'Children1', 'April', 2015),
           (6, 'Children2', 'April', 2015),
           (7, 'VandenbergsFind1', 'April', 2015),
           (8, 'VandenbergsFind2', 'April', 2015);

DELETE FROM cutattributes;

INSERT INTO cutattributes
    VALUES (1, 'Drift', 'Level23East'),
           (2, 'Drift', 'Level23East'),
           (3, 'Drift', 'Level23West'),
           (4, 'Drift', 'Level23West'),
           (5, 'Drift', 'BabyHortasCutEast'),
           (6, 'Drift', 'BabyHortasCutEast'),
           (7, 'Drift', 'BabyHortasCutWest'),
           (8, 'Drift', 'BabyHortasCutWest');


    VALUES (1, 28437.0),
           (2, 13296.0),
           (3, 13222.0),
           (4, 6473.0),
           (5, 6744.0),
           (6, 8729.0),
           (7, 10030.0),
           (8, 2345.0);

DELETE FROM gradesx;

    VALUES (1, 'Au g/tonne', 6.44),
           (1, 'Pt g/tonne', 0.54),
           (1, 'Pergium g/tonne', 15.23),
           (2, 'Au g/tonne', 7.83),
           (2, 'Pt g/tonne', 0.77),
           (2, 'Pergium g/tonne', 4.22),
           (3, 'Au g/tonne', 0.44),
           (3, 'Pt g/tonne', 3.54),
           (3, 'Pergium g/tonne', 2.72),
           (4, 'Au g/tonne', 0.87),
           (4, 'Pt g/tonne', 2.87),
           (4, 'Pergium g/tonne', 1.11),
           (5, 'Au g/tonne', 12.03),
           (5, 'Pt g/tonne', 0.33),
           (5, 'Pergium g/tonne', 10.01),
           (6, 'Au g/tonne', 8.72),
           (6, 'Pt g/tonne', 1.38),
           (6, 'Pergium g/tonne', 5.44),
           (7, 'Au g/tonne', 7.37),
           (7, 'Pt g/tonne', 1.59),
           (7, 'Pergium g/tonne', 4.05),
           (8, 'Au g/tonne', 3.33),
           (8, 'Pt g/tonne', 0.98),
           (8, 'Pergium g/tonne', 3.99);

Python Code to Run the Dump/ETL to CSV:  this is essentially a series of os.system calls to MSSQL's sqlcmd and bcp.  What made this particularly brittle and hairy is the manner in which the lifetime of temporary tables is determined in MSSQL.  To get the temporary table with my results to persist, I had to wrap its creation inside a process.  I'm ignorant as to the internal workings of buffers and memory here, but the MSSQL sqlcmd commands do not execute or write to disk exactly when you might expect them to.  Nothing is really completed until the process hosting sqlcmd is killed.

At work I actually got the bcp format file generated on the fly - I wasn't able to reproduce this behavior for this mock exercise.  Instead, I generated a bcp format file for the target table dump "by hand" and put the file in my working directory.

As I show further on, this SQL data dump will be run from a button within an Excel spreadsheet.

Mr. Spock, or better said, Horta Mother says it best:

Subprocesses, sqlcmd, bcp, Excel . . .




# Changed my laptop's name to MYLAPTOP.
# Yours will be whatever your computer
# name is.

import os
import subprocess as subx
import shlex
import time
import argparse

# Need to make sure you are in proper Windows directory.
# Can vary from machine to machine based on
# environment variables.
# Googled StackOverflow.
# 5137497/find-current-directory-and-files-directory
EXCELDIR = os.path.dirname(os.path.realpath(__file__))
print('\nCurrent directory is {:s}'.format(os.getcwd()))

parser = argparse.ArgumentParser()
# 7 digit argument like 'Apr2015'
# Feed in at command line
    help='seven digit, month abbreviation (Apr2015)',
args = parser.parse_args()
MONTHYEAR = args.monthyear

# Use Peoplesoft/company id so that more than
# one user can run this at once if necessary
# (note:  will not work if one user tries to
#         run multiple instances at the same
#         time - theoretically <not tested>
#         tables will get mangled and data
#         will be corrupt.)
USER = os.getlogin()

CSVDUMPNAME = 'csvdumpname'
CSVDUMP = 'nohandjamovnumbersbcp'
CSVEXT = '.csv'
HOMESERVERNAME = 'homeservername'
USERNAME = 'username'

# Need to fill in month, year
# with input from Excel spreadsheet.
QUERYDICT = {'month':"'{:s}'",

# For sqlcmd and bcp
ERRORFILENAME = 'errorfilename'
STDOUTFILENAME = 'stdoutfilename'
ERRX = 'sqlcmderroutput.txt'
STDOUTX = 'sqcmdoutput.txt'
EXIT = '\nexit\n'
UTF8 = 'utf-8'
GOX = '\nGO\n'

# 2 second pause.
SLEEPING = '\nsleeping {pause:d} seconds . . .\n'

# XXX - Had to generate this bcp format file
#       from table in MSSQL Management Studio -
#       dos command line:
# bcp ##TARGETX format nul -f test.fmt -S MYLAPTOP\SQLEXPRESS -t , -c -T

# XXX - you can programmatically extract
#       column names from the bcp format
#       file or
#       you can dump them from SQLServer
#       with a separate query in bcp -
#       I have done neither here
#       (I hardcoded them).
FMTFILE = 'formatfile'


# Startup for sqlcmd interactive mode.
SQLPATH = r'C:\Program Files\Microsoft SQL Server'
SQLPATH += r'\100\Tools\Binn\SQLCMD.exe'
= shlex.split(

BCPSTR = ':!!bcp "SELECT * FROM ##TARGETX{username:s};" '
BCPSTR += 'queryout {csvdumpname:s}.csv -t , '
BCPSTR += '-f {formatfile:s} -S {homeservername:s} -T'

def cleanslate():
    Delete files from previous runs.
    # XXX - only one file right now.
    files = [CSVDUMP + CSVEXT]
    for filex in files:
        if os.path.exists(filex) and os.path.isfile(filex):
    return 0

MONTHS = {'Jan':'January',

def parseworkbookname():
    Get month (string) and year (integer)
    from name of workbook (Apr2015).
    Return as month, year 2 tuple.
    # XXX
    # Write this out - will eventually
    # need error checking/try-catch
    monthx = MONTHS[MONTHYEAR[:3]]
    yearx = int(MONTHYEAR[3:])
    return monthx, yearx

# Global Temporary Tables
CREATE TABLE ##TONNES{username:s} (
    yearx INT,
    monthx VARCHAR(30),
    cutid INTEGER,
    drift VARCHAR(30),
    tonnes FLOAT);

USE JanusVIPergiumMine;

INSERT INTO ##TONNES{username:s}
    SELECT cutx.yearx,
           cutattrx.valuex AS drift,
    FROM cuts cutx
        INNER JOIN cutattributes cutattrx
            ON cutx.cutid = cutattrx.cutid
        INNER JOIN tonnes tonnesx
            ON cutx.cutid = tonnesx.cutid
    WHERE cutx.yearx = {year:d} AND
          cutx.monthx = {month:s} AND
          cutattrx.attributex = @DRIFT;

CREATE TABLE ##GRADES{username:s} (
    cutid INTEGER,
    drift VARCHAR(30),
    gradenamex VARCHAR(50),
    graden FLOAT);


USE JanusVIPergiumMine;
INSERT INTO ##GRADES{username:s}
    SELECT cutx.cutid,
           cutattrx.valuex AS drift,
    FROM cuts cutx
        INNER JOIN cutattributes cutattrx
            ON cutx.cutid = cutattrx.cutid
        INNER JOIN gradesx
            ON cutx.cutid = gradesx.cutid
    WHERE cutx.yearx = {year:d} AND
          cutx.monthx = {month:s} AND
          cutattrx.attributex = @DRIFT;

# Sum and tonne-weighted averages
    yearx INT,
    monthx VARCHAR(30),
    drift VARCHAR(30),
    tonnes FLOAT,
    gradename VARCHAR(50),
    grade FLOAT);

    SELECT tonnesx.yearx,
           SUM(tonnesx.tonnes) AS tonnes,
           gradesx.gradenamex AS gradename,
           SUM(tonnesx.tonnes * gradesx.graden)/
           SUM(tonnesx.tonnes) AS graden
    FROM ##TONNES{username:s} tonnesx
        INNER JOIN ##GRADES{username:s} gradesx
            ON tonnesx.cutid = gradesx.cutid
    GROUP BY tonnesx.yearx,

# Pivot
CREATE TABLE ##TARGETX{username:s} (
    yearx INT,
    monthx VARCHAR(30),
    drift VARCHAR(30),
    tonnes FLOAT,
    pergium FLOAT,
    Au FLOAT,
    Pt FLOAT);

DECLARE @PERGIUM CHAR(15) = 'Pergium g/tonne';
DECLARE @GOLD CHAR(10) = 'Au g/tonne';
DECLARE @PLATINUM CHAR(10) = 'Pt g/tonne';
    SELECT mpds.yearx,
           MAX(mpds.tonnes) AS tonnes,
           MAX(perg.grade) AS pergium,
           MAX(au.grade) AS Au,
           MAX(pt.grade) AS Pt
    FROM ##MONTHLYPRODDATASET{username:s} mpds
        INNER JOIN ##MONTHLYPRODDATASET{username:s} perg
            ON perg.drift = mpds.drift AND
            perg.gradename = @PERGIUM
        INNER JOIN ##MONTHLYPRODDATASET{username:s} au
            ON au.drift = mpds.drift AND
            au.gradename = @GOLD
        INNER JOIN ##MONTHLYPRODDATASET{username:s} pt
            ON pt.drift = mpds.drift AND
            pt.gradename = @PLATINUM
    GROUP BY mpds.yearx,
    ORDER BY mpds.drift;

# 1) Create global temp tables.
# 2) Fill global temp tables.
# 3) Get desired result set into the target global temp table.
# 4) Run bcp against target global temp table.
# 5) Drop global temp tables.

def getdataincsvformat():
    Retrieve data from MSSQL server.
    Dump into csv text file.
    numtables = len(CREATETABLES)
    with open('{errorfilename:s}'.format(**CMDLINEDICT), 'w') as e:
        with open('{stdoutfilename:s}'.format(**CMDLINEDICT), 'w') as f:
            sqlcmdproc = subx.Popen(SQLCMDEXE, stdin=subx.PIPE,
                    stdout=f, stderr=e)
            for i in range(numtables):
                cmdx = (CREATETABLES[i + 1]).format(**QUERYDICT)
                sqlcmdproc.stdin.write(bytes(cmdx +
                    '{go:s}'.format(**CMDLINEDICT), UTF8))
            for i in range(numtables):
                cmdx = (FILLTABLES[i + 1]).format(**QUERYDICT)
                sqlcmdproc.stdin.write(bytes(cmdx +
                    '{go:s}'.format(**CMDLINEDICT), UTF8))
            print('bcp csv dump command (from inside sqlcmd) . . .')
            sqlcmdproc.stdin.write(bytes(BCPSTR, UTF8))
            sqlcmdproc.stdin.write(bytes('{exit:s}'.format(**CMDLINEDICT), UTF8))
    return 0
monthx, yearx = parseworkbookname()

# Get rid of previous files.
print('\ndeleting files from previous runs . . .\n')

# Get month and year into query dictionary.
QUERYDICT['month'] = QUERYDICT['month'].format(monthx)
QUERYDICT['year'] = yearx



It's ugly, but it works.

Keeping with the Horta theme, this would be a good spot for an image break:

Damnit, Jim, I'm a geologist not a database programmer.

You're an analyst, analyze.

Load to Excel:  this is fairly straightforward - COM programming with Mark Hammond and company's venerable win32com.  The only working version of the win32com library I had on my laptop on which I am writing this blog entry was for a Python 2.5 release that came with an old version of our mine planning software (MineSight/Hexagon) - the show must go on!



# mpython is Python 2.5 on this machine.
# Had to remove collections.namedtuple
# (used dictionary instead) and new
# string formatting (reverted to use
# of ampersand for string interpolation).

# Lastly, did not have argparse at my
# disposal.

from __future__ import with_statement

Get numbers into spreadsheet
without having to hand jam

import os
from win32com.client import Dispatch

# Plan on receiving Excel file's
# path from call from Excel workbook.

import sys

# Path to Excel workbook.
WB = sys.argv[1]
# Worksheet name.
WSNAME = sys.argv[2]


# Looking for data file in current directory.
# (same directory as Python script)
CSVDUMP = 'nohandjamovnumbersbcp.csv'

# XXX - repeated code from data dump file.
CURDIR = os.path.dirname(os.path.realpath(__file__))
print('\nCurrent directory is %s' % os.getcwd())

# XXX - I think there's a more elegant way to
#       do this path concatenation with os.path.

# Fields in csv dump.
YEARX = 'yearx'
MONTHX = 'monthx'
DRIFT = 'drift'
TONNES = 'tonnes'
PERGIUM = 'pergium'
GOLD = 'Au'


# Excel cells.
# Map this to csv dump and brute force cycle to fill in.
ROWCOL = '%s%d'

COLUMNMAP = dict((namex, colx) for namex, colx in
        zip(FIELDS, ['A', 'B', 'C', 'D',
            'E', 'F', 'G']))

EXCELX = 'Excel.Application'

def getcsvdata():
    Puts csv data (CMP dump) into
    a list of data structures
    and returns list.
    with open(CSVPATH, 'r') as f:
        records = []
        for linex in f:
            # XXX - print for debugging/information
            print([n.strip() for n in linex.split(',')])
                (n.strip() for n
                    in linex.split(',')))))
    return records

# Put Excel stuff here.
def getworkbook(workbooks):
    Get handle to desired workbook
    for x in workbooks:
        if x.FullName == WB:
            # XXX - debug/information print statement
    return x

def fillinspreadsheet(records):
    Fill in numbers in spreadsheet.

    Side effect function.
    records is a list of named tuples.
    excelx = Dispatch(EXCELX)
    wb = getworkbook(excelx.Workbooks)
    ws = wb.Worksheets.Item(WSNAME)
    # Start entering data at row 4.
    row = 4
    for recordx in records:
        for x in FIELDS:
            column = COLUMNMAP[x]
            valuex = recordx[x]
            cellx = ws.Range(ROWCOL % (column, row))
            # Selection makes pasting of new value visible.
            # I like this - not everyone does.  YMMV
            cellx.Value = valuex
        # On to the next record on the next row.
        row += 1
    # Come back to origin of worksheet at end.
    return 0
cmprecords = getcsvdata()


On to the VBA code inside the Excel spreadsheet (macros) that execute the Python code:

Option Explicit

Const EXECX = "C:\Python34\python "
Const EXECXII = "C:\MineSight\mpython\python\2.5\python "
Const SQLSCRIPT = " "

Sub FillInNumbers()

    Dim namex As String
    Dim wb As Workbook
    Dim ws As Worksheet
    Dim longexecstr As String
    Set ws = Selection.Worksheet
    'Try to get current worksheet name to feed values to query.
    namex = ws.Name
    longexecstr = EXECXII & " " & ActiveWorkbook.Path
    longexecstr = longexecstr & Chr(92) & EXCELSCRIPT
    longexecstr = longexecstr & ActiveWorkbook.Path & Chr(92) & ActiveWorkbook.Name
    longexecstr = longexecstr & " " & namex

    VBA.Interaction.Shell longexecstr, vbNormalFocus
End Sub

Sub GetSQLData()
    Dim namex As String
    Dim ws As Worksheet
    Set ws = Selection.Worksheet
    'Try to get current worksheet name to feed values to query.
    namex = ws.Name

    VBA.Interaction.Shell EXECX & ActiveWorkbook.Path & _
        Chr(92) & SQLSCRIPT & namex, vbNormalFocus
End Sub

I always use Option Explicit in my VBA code - that's not particularly pythonic, but being pythonic inside the VBA interpreter can be hazardous.  As always, YMMV.

Lastly, a rough demo and a data check.  We'll run the SQL dump from the top button on the Excel worksheet:

And now we'll run the lower button to put the data into the spreadsheet.  It's probably worth noting here that I did not bother doing any type conversions on the text coming out of the SQL csv dump in my Python code.  That's because Excel handles that for you.  It's not free software (Excel/Office) - might as well get your money's worth.

We'll do a check on the first row for tonnes and a pergium grade.  Going back to our original data:

Cuts 1 and 2 belong to the drift Level23East.


VALUES (1, 28437.0),
       (2, 13296.0),

Total:  41733

Looks good, we know we got a sum of tonnes right.  Now the tonne-weighted average:


(1, 'Pergium g/tonne', 15.23),
(2, 'Pergium g/tonne', 4.22),

(28437 * 15.23 + 13296 * 4.22)/41733 = 11.722

It checks out.  Do a few more checks and send it out to the Janus VI Pergium Mine mine manager.


This is a messy one-off mousetrap.  That said, this is often how the sausage gets made in a non-programming, non-professional development environment.  We do have an in-house Python developer Lori.  Often she's given something like this and told to clean it up and make it into an in-house app.  That's challenging.  Ideally, the mining professional writing the one-off and the dev get together and cross-educate vis a vis the domain space (mining) and the developer space (programming, good software design and practice).  It's a lot of fun but the first go around is seldom pretty.

Thanks for stopping by.

Leonard Nimoy
1931 - 2015