Friday, December 11, 2015

Improved Storing and Displaying Images in Postgresql - bytea

Last post I brute forced the storage of binary image (jpeg) data as text in a Postgresql database, and accordingly brute forced the data's display in the Unix image viewer feh from output from a psql query.  It was hackish and I received some negative, but good constructive criticism on how to improve it:

1) use Python's base64 module instead of the binascii one.

2) use bytea as a storage type in Postgresql instead of text.

Marius Gedminus made the base64.b64encode suggestion for text.  It does make for a little less storage space.  Ultimately we won't go with this solution because we want to go with bytea, the Postgresql data type intended for this type of data.  But for completeness, here is what a base64.b64encode text solution would look like:

$ python3.5
Python 3.5.0 (default, Oct 23 2015, 21:23:18)
[GCC 4.2.1 20070719 ] on openbsd5
Type "help", "copyright", "credits" or "license" for more information.)
>>> import base64
>>> f = open('prrrailwhaletankcar.jpg', 'rb')
>>> bindata = f.read()
>>> f.close()
>>> b64data = str(base64.b64encode(bindata))
>>> # Converted data to string for write with csv file
>>> # to database table text field.
>>> # The string representation of BASE 64 includes the
>>> # letter b and single quotes.
>>> b64data[:10]
"b'/9j/4AAQ"
>>> b64data[-10:]
"AVAH/2Q=='"
>>> b64data[1]
"'"
>>> b64data[-1]
"'"
>>> # Isolate the BASE 64 digits with the quotes included.
>>> substrx = b64data[1:]
>>> picdata = base64.b64decode(substrx)
>>> f = open('test.jpg', 'wb')
>>> f.write(picdata)
187810
>>> f.close()
>>> len(substrx)
250418
>>> # BASE 64 string is 1 1/3 times as big as the
>>> # binary data it represents.
>>> _/187810
1.3333581811405144
>>> # Taking off the quote marks doesn't inhibit the
>>> # decoding of the BASE64 string at all - probably
>>> # best to go with this less is more approach.
subsubstrx = substrx[1:-1]
>>> picdata = base64.b64decode(subsubstrx)
>>> f = open('test2.jpg', 'wb')
>>> f.write(picdata)
187810
>>> f.close()
>>> len(picdata)
187810
>>> # BASE64 string ever so slightly smaller without
>>> # the quote marks (2 chars).
>>> len(subsubstrx)
250416
>>> _/187810
1.3333475320802939

>>> # Works in both cases.
>>> os.system('feh --geometry 400x300+200+200 test.jpg')

0
>>> os.system('feh --geometry 400x300+200+200 test2.jpg')

0
>>>


The results for both commands in the last lines (show picture with feh) look the same:


Storing the BASE 64 string in a Postgresql text column is the same as storing the hex one like I did in the last post.  The main thing to look out for is the proper stripping of the Python generated string for extra characters - single quotes are OK as long as they are matched on either end of the string.  As I mentioned in the code comments above, knowing what I know now, I would strip them out too even prior to storing the string in a database.

On to the Postgresql bytea storage part of the post.  Someone I respect asked me on Facebook, "Why didn't you just use bytea (for storage)?"  I had to sheepishly own up to just not being used to working with binary data (as opposed to strings) so I went with what I knew.  Shame drove me to at least attempt to do things the right way - binary storage for binary data, in this case a jpeg image.

Postgresql 9.4 uses a hex based representation (hex format) for the bytea data type by default.  It is possible to mess this up - it is covered in the doc but I didn't read it carefully enough:

If you preface your hexadecimal string with \x (single backslash) you will end up with an octal representation of your binary data (digits 0 through 7).  \\x prior to the hexadecimal string will give you what you, or at least I want, hexidecimal representation of your binary data on output.  The SQL string I used for processing my string data (already in the database from my work on the last blog post):




/* Postgresql SQL code */
CAST('\\x' || <hexadecimal string> AS bytea)

The || operator is for concatenation of strings (this is probably obvious to Postgresql and other database distro users but MSSQL uses a + symbol so it was a little new to me).

To deal with transitioning all my text picture columns to bytea I did the following:

1) create a new set of identical tables to the ones I had in the same database with new relations identical to the old ones but with the new set of tables.

2) fill the new tables in with the new data that has all the former text columns for binary as bytea.

3) delete the old tables once the new ones are filled in.

4) rename the new tables to match the names of the old ones (how I wanted the database schema to look in the first place).



Postgresql is different than MSSQL in that the database is more its own autonomous entity that needs to be connected to other databases by some introduced mechanism.  In MSSQL, databases on the same server can reference each other in queries by default.  I started looking into the Postgresql fdw (foreign data wrapper) plugin, then realized I could do this more easily with the path I took above.

It's not necessary to post all the SQL code.  I used a psql variable in my SQL for the hexadecimal data predicate to make sure I got it right each time.  From inside psql I executed the SQL files with the \i metacommand.  Here is a snippet with the variable.

/* Postgresql SQL code to be used with

   the Postgresql psql interpreter */

/* Need this for bytea conversion
   from hex string */
\set byteaidstr '\\x'
 

INSERT INTO locomotiveprototypes2
    SELECT keyx,
           namex,
           railnamex,
           paintscheme,
           photourl,
           comments,
           CAST(:'byteaidstr' || picture AS bytea)
    FROM locomotiveprototypes;


The variable thing in psql takes a little getting used to but the Postgresql documentation is good about explaining when and how to use the single quote marks and where to put them.  It worked out.

The most important part:  getting the picture to show up from a psql metacommand through the use of a python script.  Here is my modified script similar to the one in my last post:


#!/usr/local/bin/python3.5

"""
Processing of image coming out
of Postgresql query as a stream.

Deals with bytea column string
output from psql.
"""

import base64
import sys
import subprocess

DECODED = 'decoded'

SIZEMSG = '\nsize of {0:s} output = {1:d}\n'
SIZERATIOMSG = '\nsize of {0:s} output/size of binary output = {1:05.5f}\n'

# Want to avoid '\\x' in query output.
STARTINDEX = 3

FEHCMD = ['feh', '--geometry', '400x300+200+200', '-']

BYTEAFMT = 'bytea hex format'

# 2 variables track changes in size of
#     hex output from query in psql.
sizex = 0
lenxbin = 0

# Feeding to script straight from
# psql \copy metacommand.
inputx = sys.stdin.buffer.read()
sizex = len(inputx)

# print's are mainly for flagging when something goes wrong.
#     aka debugging
print(inputx[:10])
print(inputx[STARTINDEX])
print(inputx[-10:])

# -1 index in slice chops off the return character '\n'
# Need casefold=True to deal with lower case from Postgresql.
binx = base64.b16decode(inputx[STARTINDEX:-1], casefold=True)
lenxbin = len(binx)

# print's highlight size relationship between
#     hex representation and actual binary data.
print(SIZEMSG.format(BYTEAFMT , sizex))
print(SIZEMSG.format(DECODED, lenxbin))
print(SIZERATIOMSG.format(BYTEAFMT, sizex/lenxbin))

# Pops up picture on screen.
subprocess.run(FEHCMD, input=binx)

print('\nDone\n')


An important change I made from last time is fixing the call to the image viewer feh to eliminate all that hacky intermediate writing of a jpeg file that took forever (in computer time).  It turns out feh accepts binary input from a pipe or stdin just fine - I just needed to read the man page more thoroughly.

Now to see if this works:

$ psql hotrains carl
Password for user carl:
psql (9.4.4)
Type "help" for help.

hotrains=# \copy (SELECT picture FROM locomotiveprototypes WHERE keyx = 3) to program 'imageshow.py'
COPY 1
b'\\\\xffd8ffe'
102
b'8a000ffd9\n'

size of bytea hex format output = 1081720


size of decoded output = 540858


size of bytea hex format output/size of binary output = 2.00001



And we're good to go.

Thanks for stopping by.