Saturday, October 17, 2015

Storing and Displaying Images in Postgresql

Last post I set up a toy (literally) Postgresql database for my model train car collection.  A big part of the utility of the database is its ability to store images (pictures or photos) of the real life prototype and model train cars.  Postgresql (based on my google research) offers a couple methods of doing that.  I'll present how I accomplished this here.  The method I chose suited my home needs.  For a commercial or large scale project, something more efficient in the way of storage and speed of retrieval may be better.  Anyway, here goes.

I chose to store my photos as text representations of binary data in Postgresql database table columns with the text data type.  This decision was mainly based on my level of expertise and the fact that I am doing this for home use as part of a learning experience.  Storing the binary data as text inflates their size by a factor of two - very inefficient for storage.  For home use in a small database like mine, storage is hardly an issue.  At work I transfer a lot of binary data (3 dimensional mesh mined solids) to a distant server in text format using MSSQL's bcp.  Postgresql is a little different, but I am familiar with the general idea of stuffing a lot of text in a database column.

In order to get the data into comma delimited rows without dealing with a long, unwieldy string of text from the photos, I wrote a Python script to do it:

#!python3.4

"""
Prepare multiple rows of data
that includes a hexlify'd
picture for a column in
a table in the model train
database.
"""

import binascii
import os

UTF8 = 'utf-8'
# LATIN1 = 'latin-1'

INFOFILE = 'infoiii.csv'

PICTUREFILEFMT = '{:s}.{:s}'
ROWFILEOUTFMT = '{:s}row'

JPG = 'jpg'
PNG = 'png'

COMMA = ','

PATHX = '/home/carl/postgresinstall/workpictures/multiplecars/'

PATHXOUT = PATHX + 'rows/'

PHOTOMSG = 'Now doing photo {:s} . . .'

def checkfileextension(basename):
    """
    With the basename of an image file
    returns True for jpg and false for
    anything else (png).
    """
    if os.path.exists(PATHX +

        PICTUREFILEFMT.format(basename, JPG)):
        return True
    else:
        return False

with open(PATHX + INFOFILE, 'r', encoding=UTF8) as finfo:
    for linex in finfo:
        newlineparts = [x.strip() for x in linex.split(COMMA)]
        photox = newlineparts.pop()
        print(PHOTOMSG.format(photox))
        # Check for jpg or png here
        # XXX - this could be better - could actually
        #       check and return actual extension;
        #       more code and lazy.
        extension = ''
        if checkfileextension(photox):
            extension = JPG
        else:
            extension = PNG
        with open(PATHX +

            PICTUREFILEFMT.format(photox,
                extension), 'rb') as fphoto:
            contents = binascii.hexlify(fphoto.read())
            liney = COMMA.join(newlineparts)
            liney += COMMA
            liney = bytes(liney, UTF8)
            liney += contents
            with open(PATHXOUT +

                ROWFILEOUTFMT.format(photox), 'wb') as frow:
                frow.write(liney)

print('\nDone\n')


The basic gist of the script is to get each photo name provided into a file that can be later imported into a table in Postgresql.  The paths in the capitalized "constants" would have to be adjusted for your situation (I tend to go overboard on capitalized constants because I'm a lousy typist and want to avoid screwing up and then having to debug my typos).  The INFOFILE referred to in the script has roughly the following format:

<column1data>, <column2data>, . . . , <photofilename>

So the idea is to take a comma delimited file, encode it in UTF-8, and stuff the binary data from the (correct) photo at the end as text.  I designed my database tables with photos (I use the column name "picture") with the text data column as the last - this is kind of a hack, but it made scripting this easier.

An example of importing one of these "row" files into the database table from within psql:

$ psql hotrains carl
Password for user carl:
psql (9.4.1)
Type "help" for help.

hotrains=# \d
                List of relations
 Schema |          Name          | Type  | Owner
--------+------------------------+-------+-------
 public | rollingstockprototypes | table | carl
(1 row)

hotrains=# \d rollingstockprototypes
     Table "public.rollingstockprototypes"
  Column  |          Type          | Modifiers
----------+------------------------+-----------
 namex    | character varying(50)  | not null
 photourl | character varying(150) | not null
 comments | text                   | not null
 picture  | text                   | not null
Indexes:
    "rsprotoname" PRIMARY KEY, btree (namex)

hotrains=# COPY rollingstockprototypes FROM '/home/carl/postgresinstall/G39Arow' (DELIMITER ',');

COPY 1


My Python script for actually displaying a photo or image is a little hacky in that in requires checks for the size of the output versus the size of the information pulled from the Postgresql database table.  My original script would show the picture piped to the lightweight UNIX image viewer feh as partially complete.  In order to get around this I put a timed loop in the script to check that the image data were about half of the size of the text data pulled.  It works well enough, if slowly at times:

#!/usr/local/bin/python3.4

"""
Try to mimic processing of image
coming out of postgresql query
as a stream.
"""

import binascii
import os
import time
import sys

import argparse

# Name of file containing psql \copy hex output (text).
HEXFILE = '/home/carl/postgresinstall/workpictures/hexoutput'

# 2.5 seconds max delay before abort.
# Enough time to write most big pixel
# jpg's, it appears.
MAXTIME = 2.5
PAUSEX = 0.25

# Argument name.
PICTURENAME = 'picturename'

parser = argparse.ArgumentParser()
parser.add_argument(PICTURENAME)
args = parser.parse_args()
print(args.picturename)

# Name of picture file
# written from hex query.
PICNAME = args.picturename

# Extensions feh recognizes.
PNG = 'png'
JPG = 'jpg'

FILEEXTENSIONMSG = '\nFile extension {:s} detected.\n'
UNRECOGNFILENAME = '\nUnrecognized file extension for picture '

UNRECOGNFILENAME += '{:s}\n'
ABORTMSG = '\nSorry, no data available for feh.  Aborting.\n'

SLEEPMSG = '\nSleeping {:2.2f} seconds . . .\n'

SIZEHEXFILEMSG = '\nsize of hex output = {:d}\n'
SIZEBINARYMSG = '\nsize of binary file = {:d}\n'
SIZERATIOMSG = '\nsize of hex output/size of binary file '

SIZERATIOMSG += '{:05.5f}\n'

ACCEPTABLEHEXTOBINRATIO = 1.99
ABORTMSGTOOSMALL = '\nSorry, not enough data to show a '

ABORTINGTOOSMALL += 'complete picture.  Aborting.\n'

extension = PICNAME[-3:]
if extension == PNG:
    print(FILEEXTENSIONMSG.format(PNG))
elif extension == JPG:
    print(FILEEXTENSIONMSG.format(JPG))
else:
    print(UNRECOGNFILENAME.format(extension))
    print(ABORTMSG)
    sys.exit()

PICFILEFMT = '/home/carl/postgresinstall/workpictures/{:s}'
FEHFMT = 'feh -g 400x300+200+200 {:s}'

# Length of binary string.
lenx = 0
# 2 variables track changes in size of
# hex output from query in psql.
sizex = 0
sizexnew = 0
# Tracks time spent sleeping.
totaltimewait = 0.0

while totaltimewait < MAXTIME:
    # Try to make sure hex file is completely written.
    sizexnew = os.path.getsize(HEXFILE)
    if sizexnew > sizex or sizexnew == 0:
        sizex = sizexnew
        print(SLEEPMSG.format(PAUSEX))
        time.sleep(PAUSEX)
        totaltimewait += PAUSEX
    elif sizexnew == sizex:
        with open(HEXFILE, 'rb') as f2:
            with open(PICFILEFMT.format(PICNAME), 'wb') as f:
                strx = binascii.unhexlify(f2.read().strip())
                lenx = len(strx)
                print(SIZEHEXFILEMSG.format(sizexnew))
                print(SIZEBINARYMSG.format(lenx))
                print(SIZERATIOMSG.format(sizexnew/lenx))
                f.write(strx)
        break

# I don't want part of a picture.
if not (sizexnew > 0 and
        sizexnew/lenx > ACCEPTABLEHEXTOBINRATIO):
    print(ABORTMSGTOOSMALL)
    sys.exit()

# Pops up picture on screen.
os.system(FEHFMT.format(PICFILEFMT.format(PICNAME)))

print('\nDone\n')


Let's see if we can get a look at this in action - example of call from within psql:

hotrains=# \copy (SELECT decode(picture, 'hex') FROM rollingstockprototypes WHERE namex = 'G-39A Ore Jenny') to program 'cat > /home/carl/postgresinstall/workpictures/hexoutput | imageshowiii.py'
COPY 1
hotrains=#


And a screenshot of a (hopefully acceptable) result:


Depending on which directory I've logged into psql under, I may have to type the full paths of the output and Python file.

There is more I could do with this, but for now I'm OK with it.  Writing to a file and then checking on its size is slow.  There is probably a way to write to memory and check what's there, but I got stuck on that and decided to go with the less efficient solution.

Thanks for stopping by.

Setting Up Toy Postgresql Database on OpenBSD

This isn't a Python scripting post, but the next one will be on the same topic.  In this post I get a Postgresql database set up on my OpenBSD laptop and get familiar with the Postgresql environment.

I primarily use Microsoft SQL Server and vendor supplied database schemas at work.  I know Postgresql has a good reputation among open source databases, but I haven't had an opportunity to use it in a work environment (I had a brief brush with Jigsaw years back - a competitor to Modular's MSSQL-based Powerview (Dispatch) in pit mining truck tracking database - but that doesn't count.)

Anyway, as I've noted in previous posts, I run OpenBSD as my operating system on my laptop at home.  The OpenBSD project has a package for Postgresql.

The first order of business is to install the Postgresql server package.  First, I'll set up a PKG_PATH  FTP mirror location from within the ksh shell:

$ export PKG_PATH=ftp://ftp3.usa.openbsd.org/pub/OpenBSD/5.7/packages/i386/

That ftp3.usa.openbsd.org server is the one in Boulder, Colorado - that's the one I usually use.  I'm in Tucson, Arizona in the Mountain timezone, so it kind of makes sense to use that one.  My understanding is that, in general, you want to use a mirror away from the main one to spread out the bandwidth and server use for the OpenBSD (or any other open source) project.

Now to install the package - this has to be done as root.  I use sudo for this (sudo's replacement, as I understand it, in OpenBSD 5.8 will be doas(1) although you'll still be able to get sudo(1) as a package).

$ echo $PKG_PATH 
 ftp://ftp3.usa.openbsd.org/pub/OpenBSD/5.7/packages/i386/

$ sudo pkg_add postgresql-server
quirks-2.54 signed on 2015-03-09T11:04:08Z
No change in quirks-2.54
postgresql-server-9.4.1p1 (extracting)
1%
1%
2%
3% ********


<etc.>

100%
postgresql-server-9.4.1p1 (installing)
0% useradd: Warning: home directory `/var/postgresql' doesn't exist, and -m was not specified
postgresql-server-9.4.1p1 (installing)|
1%
1%
2%
3% ********


<etc.> 

100%

postgresql-server-9.4.1p1: ok
The following new rcscripts were installed: /etc/rc.d/postgresql
See rcctl(8) for details.
Look in /usr/local/share/doc/pkg-readmes for extra documentation.
$


Given an internet connection with decent speed, this all goes pretty quickly.  The first set of per cent numbers are the download of the gzippped tar package binary, the second are the unzipping and install of the Postgresql binaries in the proper location in the operating system file hierarchy.

For years I had some trouble getting my head around setting up users for Postgresql and running the daemon.  Much of my database experience is as an application user at work using Microsoft SQL Server.  We use Windows Authentication there primarily.  Working on my own UNIX-based (OpenBSD) home system is a little different.

Most of the problems I've had overcoming this user/security hump related to my lack of a good strong grasp of UNIX users and permissions (like I could do it in my sleep strong grasp).  OpenBSD is a bit unique in that it has a special name for the postgresql unprivileged user:  _postgresql.  The underscore is a convention in OpenBSD for this general class of user, usually associated with a daemon that runs on startup or gets started by root, doesn't have a login (nor a password).  Michael Lucas spends several pages with a good summary of the rational behind this, the history and its conventions in his authoritative OpenBSD book.




So, we want to take a look at the directory designated for Postgresql's data, /var/postgresql:

$ ls -lah /var | grep post


drwxr-xr-x   2 _postgresql  _postgresql   512B May 19 17:52 postgresql

$ cd postgresql

There is no data directory there (just . and .. in the /var/postgresql directory - the 2 in the ls output).  This is typically where I would get stuck in the past.  I ended up doing it manually . . . and wrong, or at least in a way that was more difficult than necessary.  Anyway, I recorded it that way, so I'll blog it as executed.

What I had difficulty understanding before was the whole unprivileged user concept.  Basically you need to use su to log on as root, then further su to log on as _postgresql:

# THIS IS AN UNNECESSARY STEP - DON'T DO THIS

$ su
Password:
# su - _postgresql
$ mkdir /var/postgresql/data
$ ls -lah /var/postgresql
total 12
drwxr-xr-x   3 _postgresql  _postgresql   512B Jun  4 19:06 .
drwxr-xr-x  23 root         wheel         512B May 19 17:52 ..
drwxr-xr-x   2 _postgresql  _postgresql   512B Jun  4 19:06 data
$ exit
# exit
$


# END UNNECESSARY STEP

Now I need a database cluster.  I want to initialize it with support for UTF-8 because I have some text data with umlauts in it (non-ASCII):

$ su
Password:
# su - _postgresql
$ initdb -D /var/postgresql/data -U postgres -A md5 -E UTF8 -W


The files belonging to this database system will be owned by user "_postgresql".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 30
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/postgresql/data/base/1 ... ok
initializing pg_authid ... ok
Enter new superuser password:
Enter it again:
setting password ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... not supported on this platform
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    postgres -D /var/postgresql/data
or
    pg_ctl -D /var/postgresql/data -l logfile start

$ exit
# exit
$ whoami
carl
$ pwd
/home/carl


A couple things:

1) There's a line in the output about fixing permissions on the existing data directory (this will show up as highlighted on the blog, possibly not in the planetpython blog feed) - had I done this correctly (just let initdb make the directory itself), that line would look something like this (I created another cluster while writing the blog just so I would understand how to do it right):

creating directory /var/postgresql/data4 ... ok


Right there in the initdb(1) man page:  "Creating a database cluster consists of creating the directories in which the database date will live . . ."  The man page goes on to explain how to get around permission problems, etc. in this process.  Note to self:  read the man page . . . carefully.

2) What I also learned is that you can make as many database clusters as you want, all with different data directories.  postgres is the superuser name you see in the documentation and /var/postgresql/data is the directory, but, as demonstrated above in the output, you could put your data in a folder called data4.  If you gave a different name at the -U switch in the initdb command, the superuser name would be different too.  Or you could have more than one cluster with postgres named superusers but with different passwords.

All that said, one cluster per physical box and the conventional names are plenty for me - I'm just trying to get used to the Postgresql environment and get started.

At this point I need to start up the Postgresql daemon.  In the package install above, the output mentions an rc script /etc/rc.d/postgresql.  This is run by root - below is a demo of using it manually with su (instead of using it as part of an rc startup sequence at boot):

$ su
Password:

# /etc/rc.d/postgresql start
postgresql(ok)
# pgrep postgres
6960
10175
4748
29053
32758
26201
# /etc/rc.d/postgresql stop                                                   
postgresql(ok)
# pgrep postgres


All I did there was start the Postgresql daemon with the installed rc script, check to see that it's associated processes are running, then stop the daemon with the same script.

Me being me, I can't leave good enough alone.  I want the control of starting and stopping the daemon when I decide to (I am running this on a laptop).  As I understand it, pg_ctl is a wrapper program provided with the Postgresql install for even more low level commands and functionality.  I use pg_ctl to run the daemon and start it with the _postgresql user account:

$ su
Password:
 

# su - _postgresql
$ pg_ctl -D /var/postgresql/data -l firstlog start
server starting
$ exit
# exit
$


I asked pg_ctl to make a specific log file for this session (firstlog - this will go in directory /var/postgresql/).  The logs are human readable and I wanted to study them later to see what's going on (there's all kinds of stuff in there about autovacuum and what not - sorry, we're not covering that in this blog post - but I'll have it available later).

Shutting down (stopping) the daemon is pretty simple with pg_ctl - just a few more keystrokes than if I had done it from root with the rc script:

$ su
Password:
# su - _postgresql
$ pg_ctl -D /var/postgresql/data stop
waiting for server to shut down.... done
server stopped
$ exit
# exit
$ whoami
carl
$
 

Great - so I'm good for getting the daemon going when I want it and for designating my own specific log files per session.  Now to create a user and get to work:

(with daemon running):

$ psql -U postgres
Password for user postgres:
psql (9.4.1)
Type "help" for help.

postgres=# CREATE ROLE carl SUPERUSER;
CREATE ROLE
postgres=# ALTER ROLE carl PASSWORD 'xxxxxxxx'
ALTER ROLE
postgres=# ALTER USER carl PASSWORD 'xxxxxxxx' LOGIN;
ALTER ROLE
postgres=# \q

$

I created a user/role carl with SUPERUSER capabilities within this instance of Postgresql.  It's a bit ugly and I'm not sure I've done this correctly or the easiest way.  Also, and of importance, I have given Postgresql user carl (not OpenBSD user carl) all permissions on everything.  Really, carl only needs permissions to work on the database he's working on.  Josh Drake (@linuxhiker on twitter) pointed this out to me.  I am grateful for this.  He is right.  I am lazy.

Now to create my database.  I got into model trains around Christmas of 2015 and went crazy collecting stuff and setting up a layout.  I needed to somehow keep track of all the cars before it all got too unwieldy.

$ psql postgres carl
Password for user carl:
psql (9.4.1)
Type "help" for help.

postgres=# CREATE DATABASE hotrains;
CREATE DATABASE
postgres=# \q



The command line entry to start psql is something I'm a bit fuzzy on - postgres isn't, to the best of my knowledge, a database per se, but a means of connecting to psql when you don't want to designate a default database ot work on.

I'm not going to post the full database code for the sake of brevity - it's only 11 tables but that's a bit much for a blog post.  Instead I'll post a graphic schema I made and talk to it a little bit before posting one related SQL code snippet.

Disclaimer:  I'm not a designer.  This schema diagram I did with Dia, a fairly lightweight Linux/UNIX desktop tool for flowcharts and stuff.  I've never met a color palette or font choice I could simply let be.  Asking me to do a flowchart with a lot of leeway on design is like leaving a two year old home alone with a Crayola 64 pack of crayons and the 300 year old family Bible - it can't end well.

All that said, I find schema diagrams helpful for showing relationships between tables and having an ugly one is better than none at all.  I've embedded an svg version of it below; hopefully it shows up on the planetpython feed:




The focus of my crude toy database design was the use of foreign keys to maintain consistency in naming things I want to track:  rail name for example.  I went with "Santa Fe" where I could have went with (and probably should have) "ATSF."  It doesn't matter as long as it's consistent and I know what it means.

Years ago I was called in to do some work on a blasting database at the mine.  There weren't any constraints on the entry of names of blasting materials, but what could go wrong?  There were only three or four products with four digit designators and "None."  Well . . . it was a mess.  I didn't want to take any chances on having a situation like that again, even, or especially, if I was doing all of the data entry.  Foreign keys it was!

Here's a quick dump of the code I used to create the validsidenumbers table.  The idea is to make sure the rail line or company name is consistent in all the side number records (yes, I did actually purchase some identical rolling stock with the exact same side numbers - it's a long story):

hotrains=# CREATE TABLE validsidenumbers (
railnamex        varchar(50) REFERENCES validraillines (namex),
sidenumber       varchar(50),
comments         text,
PRIMARY KEY (railnamex, sidenumber)
);
CREATE TABLE
hotrains=#


That REFERENCES keyword sees to it that I won't enter anything typo'd or goofy into that railnamex column.

Next post is a Python one about pulling storing images of the train cars in the database and displaying them from within psql.

Thanks for stopping by.