pyright

DAG Hamilton Graph Presented as SVG in Blogger

2024-09-27T12:30:00.000-07:00

Through the kindness of the DAG Hamilton project team, I was able to secure an official svg version of the DAG Hamilton logo. It looks significantly better than the one I had generated with an online image to svg converter and is much smaller and easy to work with (4 kilobytes versus 200 kb). The DAG Hamilton graphviz graph now shows up in Blogger; it is unlikely to show up on the planet(python) feed. Blogger is not liking the code and svg I have included (complaints of malformed html). In the interest of preserving the rendering of the graph(s), I am constraining the text here to a few paragraphs

The first graph has the code provided. This graph is from a previous post.

The second graph represents the DAG Hamilton workflow for the production of the first graph. This is in keeping with the "Eat your own dogfood" mantra. I happen to like the DAG Hamilton dogfood as I've mentioned in previous posts. It allows me to visualize my workflows and track complexity and areas for improvement in the code.

The third one I did with a scaled down version of the code presented (no logos). I hand pasted the DAG Hamilton official logo into the third one. It is not subtle (the logo is huge), but it provides an idea of what one can do creatively with the logo or any svg element. Also, it shows the DAG Hamilton workflow for a graph.

All the code is a work in progress. Ideally I would like to keep reducing this to the most simple svg implementation possible to get it to show up or "work." Realistically, I'm afraid to sneeze for fear Blogger will protest. For now, I'm leaving good enough alone. Links and thoughts on svg (there is at least one python library (orsinium-labs/svg.py) out there that is way more elegant in its treatment of the medium than my rough regular expressions / text processing) will have to wait for another post.

Thanks for stopping by.

run.py code


"""
Hamilton wrapper.
"""

# run.py

import sys

import pprint

from hamilton import driver

import dag_hamilton_to_blogger as dhtb

dr = driver.Builder().with_modules(dhtb).build()

dr.display_all_functions('dhtb.svg',
                         deduplicate_inputs=True,
                         keep_dot=True,
                         orient='BR')

results = dr.execute(['defluffed_lines',
                      'scale_and_translation',
                      'logo_positions',
                      'captured_values',
                      'scaled_elements',
                      'translated_elements',
                      'hamilton_logo_data',
                      'scale_and_translation_hamilton_logo',
                      'fauxcompany_logo_data',
                      'scale_and_translation_fauxcompany_logo',
                      'svg_ready_doc',
                      'written_svg'],
                      inputs={'svg_file':'web_scraping_functions_highlighted.svg',
                              'outputfile':'test_output.svg',
                              'hamiltonlogofile':'hamilton_official_stripped.svg',
                              'hamiltonlogo_coords':{'min_x':-0.001,
                                                     'max_x':4353.846,
                                                     'min_y':-0.0006,
                                                     'max_y':4177.257},
                              'fauxcompanylogofile':'fauxcompanylogo_stripped_down.svg',
                              'fauxcompanylogo_coords':{'min_x':11.542786063261742,
                                                        'max_x':705.10684,
                                                        'min_y':4.9643821,
                                                        'max_y':74.47416391682819}})



Main DAG Hamilton functions (dag_hamilton_to_blogger.py)


# python 3.12

"""
Make DAG Hamilton graph show up in Blogger.
"""

import re
import sys
import pprint
import math
import copy

import reusedfunctions as rf

VIEWBOX_PAT = (r'[ ]viewBox[=]["][-]?[0-9]+[.]?[0-9]*[ ][-]?[0-9]+[.]?[0-9]*[ ]'
               r'([0-9]+[.]?[0-9]*)[ ]([0-9]+[.]?[0-9]*)')

# 5 coordinates.
POLYGON_PAT = (r'[<]polygon'
               r'.*([ ]points[=]["])([-]?[0-9]+[.]?[0-9]*)[,]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                  r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                  r'([-]?[0-9]+[.]?[0-9]*)["]')

# 4 coordinates instead of 5.
POLYGON_PAT_4 = (r'[<]polygon'
                 r'.*([ ]points[=]["])([-]?[0-9]+[.]?[0-9]*)[,]'
                                    r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                    r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                    r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                    r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                    r'([-]?[0-9]+[.]?[0-9]*)[ ]'
                                    r'([-]?[0-9]+[.]?[0-9]*)[,]'
                                    r'([-]?[0-9]+[.]?[0-9]*)["]')

# x, y
TEXTPAT = (r'')

IMAGE_FLAG = '')
# 4 coords (arrow head).
POLYGON_STR_4 = (r' points="{0:.3f},{1:.3f} {2:.3f},{3:.3f} '
                         r'{4:.3f},{5:.3f} {6:.3f},{7:.3f}"/>')
PATH_START_STR = r' d="M{0:.3f},{1:.3f}C'
PATH_STR_SEGMENT = ' {0:.3f},{1:.3f}'
PATH_STR = r' {0:s}"/>'
TEXT_STR = r' x="{0:.3f}" y="{1:.3f}"'
TEXT_STR_FONT = r' font-size="{0:.3f}"'

HAMILTON_LOGO_DIMENSIONS_PAT = (r'.*width[=]["]([0-9]+[.]?[0-9]*)px["][ ]'
                                  r'height[=]["]([0-9]+[.]?[0-9]*)px["][>]')

FAUXCOMPANY_LOGO_DIMENSIONS_PAT = (r'[ ]width[=]["]([0-9]+[.]?[0-9]*)["][ ]'
                                      r'height[=]["]([0-9]+[.]?[0-9]*)["][ ][>]')

# The official Hamilton logo splits the path into multiple
# lines with the last one having the absolute location
# ("C") of a bezier curve.
HAMILTON_CHANGE_LINE_PAT = r'.*C[-]?[0-9]+[.]?[0-9]*'

HAMILTON_TRANSFORM_FMT = (' transform="scale({scale:f}) '
                           'translate({translate_x:f},{translate_y:f})" />')

# One line of paths in Inkscape generated file.
FAUXCOMPANY_CHANGE_LINE_PAT = r'.*d[=]["]m[ ]'

# Inkscape put the closing tag /> on the following line.
FAUXCOMPANY_TRANSFORM_FMT = (' transform="scale({scale:f}) '
                              'translate({translate_x:f},{translate_y:f})"')

# * - get rid of first 6 lines.

# * - get rid of any line starting with:
#
#   " list:
    """
    Purge svg files of unnecessary lines (for 
    further operations).

    Returns list of file string lines.
    """
    print('Defluffing graphviz svg output . . .')
    with open(svg_file, 'r') as f:
        lines = [linex for linex in f]
    lines = lines[6:]
    bifflist = [' -1
                          for n in bifflist])]
    return retval

def scale_and_translation(defluffed_lines:list) -> dict:
    """
    Get relevant values for scaling and translation.

    defluffed_lines is a list of line strings.

    Returns dictionary.
    """
    retval = {}
    print('Getting scale and translation . . .')
    pat = re.compile(VIEWBOX_PAT)
    # Second line has everything.
    match = pat.match(defluffed_lines[1])
    retval['viewBox_x'], retval['viewBox_y'] = [float(x) for x in match.groups()]
    pat = re.compile(POLYGON_PAT)
    # Third line has this.
    match = pat.match(defluffed_lines[2])
    polycoords = [float(x) for x in match.groups()[1:]]
    x_coords = polycoords[0::2]
    y_coords = polycoords[1::2]
    retval['x_translation'] = -1.0 * min(x_coords)
    retval['y_translation'] = -1.0 * min(y_coords) 
    # Try to make it about 600 wide.
    scale = X_SIZE / max(x_coords)
    retval['x_translation_scaled'] = retval['x_translation'] * scale
    retval['y_translation_scaled'] = retval['y_translation'] * scale
    retval['scale'] = scale
    retval['width'] = math.ceil(scale * retval['viewBox_x'])
    retval['height'] = math.ceil(retval['y_translation_scaled'])
    return retval

def logo_positions(defluffed_lines:list) -> dict:
    """
    Get logo positions, size, etc.

    defluffed_lines is a list of svg file lines.

    Returns dictionary.
    """
    retval = {}
    target_indices = [x for x in range(len(defluffed_lines))
                      if defluffed_lines[x].find(IMAGE_FLAG) > -1]
    pat = re.compile(POLYGON_PAT)
    # Hamilton logo.
    match = pat.match(defluffed_lines[target_indices[0] - 1])
    polycoords = [float(x) for x in match.groups()[1:]]
    retval['hamilton_posit'] = polycoords
    # Company logo.
    match = pat.match(defluffed_lines[target_indices[1] - 1])
    polycoords = [float(x) for x in match.groups()[1:]]
    retval['company_posit'] = polycoords
    retval['target_indices'] = target_indices
    return retval

def captured_values(defluffed_lines:list,
                    logo_positions:dict) -> list:
    """
    Make list of dictionaries for each line
    in stripped down svg file.
    """
    # Idea is to get values to be scaled and 
    # translated and stop indices within string
    # for later processing.
    polygonpat = re.compile(POLYGON_PAT)
    # For polygon with only 4 coordinates instead of 5.
    polygonpat_4 = re.compile(POLYGON_PAT_4) 
    textpat = re.compile(TEXTPAT)
    # For font size.
    textpat_fontsize = re.compile(TEXTPAT_FONTSIZE)
    pathpat = re.compile(PATHPAT)
    retval = []
    for idx, linex in enumerate(defluffed_lines):
        if idx in logo_positions['target_indices']:
            retval.append({'type':'logo position'})
        else:
            if match := polygonpat.match(linex):
                newdict = {'type':'polygon',
                           'start':match.start(1),
                           'groups':[float(x) for x in match.groups()[1:]]}
                retval.append(newdict)
            elif match := polygonpat_4.match(linex):
                newdict = {'type':'polygon',
                           'start':match.start(1),
                           'groups':[float(x) for x in match.groups()[1:]]}
                retval.append(newdict)
            elif match := textpat.match(linex):
                newdict = {'type':'text',
                           'span':match.span(),
                           'start':match.start(1),
                           'groups':[float(x) for x in match.groups()[1:]]}
                match_fontsize = textpat_fontsize.match(linex)
                newdict['fontsize_start'] = match_fontsize.start(1)
                newdict['font-size'] = float(match_fontsize.groups()[1])
                newdict['fontsize_end'] = match_fontsize.span()[1]
                retval.append(newdict)
            elif match := pathpat.match(linex):
                span = match.span()
                newdict = {'type':'path',
                           'span':span,
                           'start':match.start(1),
                           'groups':[float(x) for x in match.groups()[1:]],
                           'tail':rf.parse_path(linex, span[1])}
                retval.append(newdict)
            else:
                retval.append({'type':'missed'})
    return retval

def scaled_elements(captured_values:list, scale_and_translation:dict) -> list:
    """
    Takes list of svg file line
    dictionaries and scales all
    the coordinates by a factor.

    Returns new list.
    """
    scale = scale_and_translation['scale']
    retval = []
    for linex in captured_values:
        if linex['type'] == 'missed':
            retval.append(linex)
        elif linex['type'] == 'logo position':
            retval.append(linex)
        elif 'groups' in linex:
            el = copy.deepcopy(linex)
            el['groups'] = [scale * x for x in el['groups']]
            # path
            if 'tail' in linex:
                el['tail'] = [[x * scale for x in n]
                              for n in el['tail']]
            if el['type'] == 'text':
               el['font-size'] = scale * el['font-size']
            retval.append(el)
    return retval

def translated_elements(scaled_elements:list, scale_and_translation:dict) -> list:
    """
    Takes list of svg file line
    dictionaries and translates all
    the coordinates by a distance
    
    scale_and_translation is a dictionary.

    Returns new list.
    """
    x_translation = scale_and_translation['x_translation_scaled']
    y_translation = scale_and_translation['y_translation_scaled']
    retval = []
    for linex in scaled_elements:
        el = copy.deepcopy(linex)
        if linex['type'] == 'missed':
            retval.append(el)
        elif linex['type'] == 'logo position':
            retval.append(linex)
        elif 'groups' in linex:
            el['groups'] = []
            for idx, num in enumerate(linex['groups']):
                if idx % 2 == 0:
                    el['groups'].append(num + x_translation)
                else:
                    el['groups'].append(num + y_translation)
            # path
            if 'tail' in linex:
                el['tail'] = []
                coord = []
                for coordx in linex['tail']:
                    el['tail'].append([coordx[0] + x_translation,
                                       coordx[1] + y_translation])
            retval.append(el)
    return retval

def hamilton_logo_data(hamiltonlogofile:str,
                       hamiltonlogo_coords:dict) -> dict:
    """
    Attempt to get relevant data for Hamilton logo svg file.

    Returns dictionary of information on svg.
    """
    retval = {}
    retval.update(hamiltonlogo_coords)
    with open(hamiltonlogofile, 'r') as f:
        originallines = [linex for linex in f]
    retval['originallines'] = originallines
    dimensionspat = re.compile(HAMILTON_LOGO_DIMENSIONS_PAT)
    # Second line.
    match = dimensionspat.match(originallines[1])
    # x, y
    retval['dimensions'] = [float(x) for x in match.groups()]
    return retval

def scale_and_translation_hamilton_logo(hamilton_logo_data:dict,
                                        logo_positions:dict,
                                        defluffed_lines:list,
                                        scale_and_translation:dict) -> dict:
    """
    Returns dictionary with scale and x and y
    translations for Hamilton logo.
    """
    retval = {}
    # index just before imaage position has polygon coords (rectangle).
    logo_poly_idx = logo_positions['target_indices'][0] - 1
    target_line = defluffed_lines[logo_poly_idx]
    pat = re.compile(POLYGON_PAT)
    match = pat.match(target_line)
    polycoords = [float(x) for x in match.groups()[1:]]
    x_coords = polycoords[0::2]
    y_coords = polycoords[1::2]
    # Need to scale and translate these.
    y_size = max(y_coords) - min(y_coords)
    y_size *= scale_and_translation['scale']
    scale = y_size / hamilton_logo_data['dimensions'][1]
    print('hamilton logo scale = {0:f}'.format(scale))
    retval['scale'] = scale
    retval['x_posit'] = (min(x_coords) *
                         scale_and_translation['scale'] +
                         scale_and_translation['x_translation_scaled'])
    retval['y_posit'] = (min(y_coords) *
                         scale_and_translation['scale'] +
                         scale_and_translation['y_translation_scaled'])
    # Get translation from upper left corner of logo
    retval['translate_x'] = retval['x_posit'] - hamilton_logo_data['min_x']
    retval['translate_y'] = retval['y_posit'] - hamilton_logo_data['min_y']
    # Translation of the path element remains in the original coordinate
    # space - not impacted by the scale transformation on the same svg line.
    retval['translate_x'] /= scale
    retval['translate_y'] /= scale
    return retval

def fauxcompany_logo_data(fauxcompanylogofile:str,
                          fauxcompanylogo_coords:dict) -> dict:
    """
    Attempt to get relevant data for the contrived
    company logo svg file.

    Returns dictionary of information on svg.
    """
    retval = {}
    retval.update(fauxcompanylogo_coords)
    with open(fauxcompanylogofile, 'r') as f:
        originallines = [linex for linex in f]
    retval['originallines'] = originallines
    dimensionspat = re.compile(FAUXCOMPANY_LOGO_DIMENSIONS_PAT)
    # Second line.
    match = dimensionspat.match(originallines[1])
    # x, y
    retval['dimensions'] = [float(x) for x in match.groups()]
    return retval

def scale_and_translation_fauxcompany_logo(fauxcompany_logo_data:dict,
                                           logo_positions:dict,
                                           defluffed_lines:list,
                                           scale_and_translation:dict) -> dict:
    """
    Returns dictionary with scale and x and y
    translations for contrived company logo.
    """
    retval = {}
    # index just before imaage position has polygon coords (rectangle).
    logo_poly_idx = logo_positions['target_indices'][1] - 1
    target_line = defluffed_lines[logo_poly_idx]
    pat = re.compile(POLYGON_PAT)
    match = pat.match(target_line)
    polycoords = [float(x) for x in match.groups()[1:]]
    x_coords = polycoords[0::2]
    y_coords = polycoords[1::2]
    # Need to scale and translate these.
    y_size = max(y_coords) - min(y_coords)
    y_size *= scale_and_translation['scale']
    scale = y_size / fauxcompany_logo_data['dimensions'][1]
    print('fauxcompany logo scale = {0:f}'.format(scale))
    retval['scale'] = scale
    # hack
    retval['scale'] /= 1.08
    retval['x_posit'] = (min(x_coords) *
                         scale_and_translation['scale'] +
                         scale_and_translation['x_translation_scaled'])
    retval['y_posit'] = (min(y_coords) *
                         scale_and_translation['scale'] +
                         scale_and_translation['y_translation_scaled'])
    # Get translation from upper left corner of logo
    retval['translate_x'] = retval['x_posit'] - fauxcompany_logo_data['min_x']
    retval['translate_y'] = retval['y_posit'] - fauxcompany_logo_data['min_y']
    # hack
    retval['translate_x'] += 24
    retval['translate_y'] += 17
    # Translation of the path element remains in the original coordinate
    # space - not impacted by the scale transformation on the same svg line.
    retval['translate_x'] /= scale
    retval['translate_y'] /= scale
    return retval

def svg_ready_hamilton_logo(scale_and_translation_hamilton_logo:dict,
                            hamilton_logo_data:dict) -> list:
    """
    Get list of strings for Hamilton logo svg
    to insert into final svg.
    """
    pat = re.compile(HAMILTON_CHANGE_LINE_PAT)
    retval = []
    # Adobe generated file has initial svg tag split
    # into two lines.
    for linex in hamilton_logo_data['originallines'][2:-1]:
        if pat.match(linex):
            # do thing.
            retval.append(linex[:-3] + 
                          HAMILTON_TRANSFORM_FMT.format(**scale_and_translation_hamilton_logo))
        else:
            retval.append(linex)
    return retval

def svg_ready_fauxcompany_logo(scale_and_translation_fauxcompany_logo:dict,
                               fauxcompany_logo_data:dict) -> list:
    """
    Get list of strings for fauxcompany logo svg
    to insert into final svg.
    """
    pat = re.compile(FAUXCOMPANY_CHANGE_LINE_PAT)
    retval = []
    # Inkscape generated file has initial svg tag split
    # into two lines.
    for linex in fauxcompany_logo_data['originallines'][2:-1]:
        if pat.match(linex):
            # do thing.
            retval.append(linex[:] + 
              FAUXCOMPANY_TRANSFORM_FMT.format(**scale_and_translation_fauxcompany_logo))
        else:
            retval.append(linex)
    return retval

def svg_ready_doc(translated_elements:list,
                  scale_and_translation:dict,
                  logo_positions:dict,
                  defluffed_lines:list,
                  svg_ready_hamilton_logo:list,
                  svg_ready_fauxcompany_logo:list) -> list:
    """
    Returns list of string lines of svg file
    ready to write.

    Inputs: list of dictionaries, dictionary, dictionary, list, list.
    """
    retval = []
    retval.append(NEW_FIRST_LINE.format(scale_and_translation['width'],
                                        # Add padding at bottom. CBT 2024-09-20
                                        scale_and_translation['height'] + 10))
    retval.append(NEW_SECOND_LINE.format(scale_and_translation['viewBox_x'] *
                                         scale_and_translation['scale'],
                                         scale_and_translation['y_translation_scaled']))
    for idx, text_values in enumerate(zip(defluffed_lines, translated_elements)):
        # First two lines already dealt with.
        if idx in (0, 1):
            continue
        # Skip images.
        if idx in logo_positions['target_indices']:
            # Dummy empty string. CBT 2024-09-24
            retval.append('')
        text = text_values[0]
        values = text_values[1]
        linestr = ''
        if values['type'] == 'polygon':
            linestr += text[:values['start']]
            # rectangle
            if len(values['groups']) == 10:
                linestr += POLYGON_STR.format(*values['groups'])
            # 4 coordinates (triangle).
            elif len(values['groups']) == 8:
                linestr += POLYGON_STR_4.format(*values['groups'])
            retval.append(linestr)
        elif values['type'] == 'text': 
            retval.append(text[:values['start']] +
                          TEXT_STR.format(*values['groups']) +
                          text[values['span'][1]:values['fontsize_start']] +
                          TEXT_STR_FONT.format(values['font-size']) +
                          text[values['fontsize_end']:])
        elif values['type'] == 'path': 
            linestr = text[:values['start']]
            linestr += PATH_START_STR.format(*values['groups'])
            pathstr = ''
            for coordx in values['tail']:
                pathstr += PATH_STR_SEGMENT.format(*coordx)
            linestr += PATH_STR.format(pathstr[:-1])
            retval.append(linestr)
    retval.append('')
    # Deal with logos here.
    # Company logo first.
    # "erase" box.
    target_index = logo_positions['target_indices'][1] - 1
    retval[target_index] = retval[target_index].replace('stroke="black"','stroke="none"')
    retval = (retval[:logo_positions['target_indices'][1]] +
              svg_ready_fauxcompany_logo +
              retval[logo_positions['target_indices'][1] + 1:])
    # Hamilton logo.
    # "erase" box.
    target_index = logo_positions['target_indices'][0] - 1
    retval[target_index] = retval[target_index].replace('stroke="black"','stroke="none"')
    retval = (retval[:logo_positions['target_indices'][0]] +
              svg_ready_hamilton_logo +
              retval[logo_positions['target_indices'][0] + 1:])
    return retval

def written_svg(svg_ready_doc:list,
                outputfile:str) -> str:
    """
    Write out doc.
    """
    with open(outputfile, 'w') as f:
        for linex in svg_ready_doc:
            f.write(linex)
    return outputfile tag



reusedfunctions.py


    # python 3.12

"""
Auxiliary module to Hamilton svg script.
"""

# reusedfunctions.py

import re

import pprint

# rest of points
COORDPATHPAT = (r'([-]?[0-9]+[.]?[0-9]*)[,]'
                r'([-]?[0-9]+[.]?[0-9]*)[ ]')

COORDPATH_ENDPAT = (r'([-]?[0-9]+[.]?[0-9]*)[,]'
                    r'([-]?[0-9]+[.]?[0-9]*)["][/][>]')

# For follow on points.
coordpathpat = re.compile(COORDPATHPAT)
# For final follow on point.
coordpath_endpath = re.compile(COORDPATH_ENDPAT)

def parse_path(pathstr, startposit):
    """
    Parse remainder of svg path element.

    pathstr string input format is

        261.26,-362.22 245.25,-355.88 229.64,-349.7 etc

    startposit is an integer.

    Returns list with path coordinate lists (floats).
    """
    retval = []
    while match := coordpathpat.match(pathstr, startposit):
        span = match.span()
        retval.append([float(x) for x in match.groups()])
        startposit = span[1]
    match = coordpath_endpath.match(pathstr[startposit:])
    retval.append([float(x) for x in match.groups()])
    return retval tag



Scalable Vector Graphics Followup
2024-08-31T06:47:00.000-07:00
 A quick follow up to the last post:

1) I spelled scalable wrong in the title; hopefully it's fixed.
2) The svg partial logo that rendered in Blogger did not render on the Planet Python feed.

3) The logo svg is too big for Blogger. I need to find a way to make a smaller (file size) one.

One more try with svg in Blogger. The png DAG Hamilton logo first:



And one more attempt:









Scalable Vector Graphics (svg) - Decomposing and Scaling Elements for Blogger
2024-08-30T23:35:00.000-07:00
Last post I lamented the failure of any the svg graphs I had for DAG Hamilton workflows to render in Blogger. I set out to rectify this and have had some initial success.



My first attempt to get svg to show up in Blogger has the DAG Hamilton logo as its subject. My approach was to bring the svg elements down to the most basic primitives I could manage, then scale and translate their individual coordinates to bring them into the view.



Fortunately, I was able to find an example from someone who had successfully rendered svg in Blogger. The subject blog post is eleven years old. It appears svg never totally caught on for some platforms. Nonetheless, I used this as a template, and, after some initial success, managed to render the Hamilton logo.



The post will step through the individual path components of the Hamilton logo (7) and show the code used to transform the coordinates to make the svg elements render at an appropriate location and size. The individual path elements that make up the logo are a large number of 3 coordinate bezier curves. The manner in which the curves listed is very format specific to the platform that created them. Unfortunately, I cannot recall which online png to svg converter I used to create the svg. Portable Inkscape kept crashing on the large png file, so I brute forced the issue by using an online converter.



The first element of the logo is basically a purple background for the whole logo. A gap or divit can be seen just left of center on the upper part. I'll cover the manual "fixing" of this further on.



Second path (orange)



Third path (deep purple)





Fourth path (yellow top point)

                                        

Fifth path (light orangey left leg were the star to face out from the screen)

                                      

Sixth path (medium purple little triangle in the center)

                                      

Seventh path (light orangey little triangle to the right of center)

                                    


Fixing that niggling divit on the top half of the logo (the thin subvertical line).


                                     


Final product. This is the one svg inline drawing I was able to show (content size limitations of Blogger?)








The full logo refuses to render (although I swear it did before). Well, at least there is an svg element showing up on the page (a purple star). <sigh> Another png . . .

                                    


The code. I'm not strong on HTML, but it was necessary to edit this post inline. As part of that exercise, I included the post outline generation as part of the script.


# python 3.12

# blog_post_make_outline.py

"""
Attempt to scale, translate, and inline svg
elements for display in Blogger.
"""

import re

import pprint

import sys

import copy

import xml.etree.ElementTree as ET

# REGEX patterns.

PATHPAT = r'[<]path[ ]d[=]["]'

MPAT = r'M([-]*[0-9]+[.]*[0-9]*)[ ]([-]*[0-9]+[.]*[0-9]*)[ ]'

          # first bezier curve coord
BEZPAT = (r'C([-]*[0-9]+[.]*[0-9]*)[ ]([-]*[0-9]+[.]*[0-9]*)[ ]'
          # second bezier curve coord
          r'([-]*[0-9]+[.]*[0-9]*)[ ]([-]*[0-9]+[.]*[0-9]*)[ ]'
          # third bezier curve coord
          r'([-]*[0-9]+[.]*[0-9]*)[ ]([-]*[0-9]+[.]*[0-9]*)[ ]')

ZPAT = r'Z[ ]'

FILLPAT = (r'["] fill[=]["]([#][A-F0-9][A-F0-9][A-F0-9]'
           r'[A-F0-9][A-F0-9][A-F0-9])["][ ]')

#                 transform="translate(4756.96875,109.12890625)" 
#                 transform="translate(5619,4112)"/>
TRANSFORMPAT = (r'transform[=]["]translate[(]'
                r'([-]*[0-9]+[.]*[0-9]*)[,]([-]*[0-9]+[.]*[0-9]*)[)]["]')

# Output formats/constants.

BEZFMT = ('C{0:.7f} {1:.7f} '
          '{2:.7f} {3:.7f} '
          '{4:.7f} {5:.7f} ')

PATHFMT_OPEN = '<path d="'

PATHFMT_1 = 'M{mstartx:.5f} {mstarty:.5f} {path:s} Z '

PATHFMT_2 = '" fill="{fill:s}" transform="translate({translatex:.7f},{translatey:.7f})"'

PATHFMT_CLOSE = ' />'

SVG_TAG_OPEN = ('<svg xmlns="http://www.w3.org/2000/svg" '
                'xmlns:xlink="http://www.w3.org/1999/xlink" '
                "width='500px' height='500px'>")

SVG_TAG_CLOSE = '</svg>'

def parse_path(pathstring):
    """
    Capture path elements in a dictionary.

    pathstring is the svg string for the path (one line).

    For a path comprised entirely of bezier curves
    in the format (all one line):

    <ns0:path d="M0 0 C2.54601018 1.57157072 5.09846344 3.13131386 7.65625 4.68359375 C39.179 . . .  0 0 Z M-690.96875 4007.87109375 C-707.702 . . .  Z " fill="#C3368C" transform="translate(4756.96875,109.12890625)" />

    Returns dictionary.
    """
    retval = {}
    patpath = re.compile(PATHPAT)
    match = patpath.match(pathstring)
    startindex = match.span()[1]
    mpat = re.compile(MPAT)
    # MPAT
    match = mpat.match(pathstring[startindex:])
    mpatgroups = match.groups()
    retval['mpatgroups'] = []
    retval['mpatgroups'].append(mpatgroups)
    startindex += match.span()[1]
    bezpat = re.compile(BEZPAT)
    zpat = re.compile(ZPAT)
    retval['paths'] = []
    while match:
        pathpoints = []
        # BEZPAT
        match = bezpat.match(pathstring[startindex:])
        while match:
            # Sentinel.
            if not match:
                continue
            pathpoints.append(match.groups())
            startindex += match.span()[1]
            match = bezpat.match(pathstring[startindex:])
        retval['paths'].append(pathpoints)
        # ZPAT
        match = zpat.match(pathstring[startindex:])
        startindex += match.span()[1]
        # Then look for MPAT
        # MPAT
        match = mpat.match(pathstring[startindex:])
        # If MPAT not there, work on color and transform.
        if not match:
            continue
        startindex += match.span()[1]
        retval['mpatgroups'].append(mpatgroups)
    fillpat = re.compile(FILLPAT)
    match = fillpat.match(pathstring[startindex:])
    startindex += match.span()[1]
    print('adding fill . . .')
    fill = match.groups()[0]
    retval['fill'] = fill
    transformpat = re.compile(TRANSFORMPAT)
    match = transformpat.match(pathstring[startindex:])
    transform = match.groups()
    print('adding transform . . .')
    retval['transform'] = transform
    return retval

def parse_all_paths(svgfilepath):
    """
    Finds and parses all svg paths
    within an svg file (very format
    specific - bezier curves only).

    Returns list of dictionaries, one
    for each path line of the svg file.
    """
    # Do all paths in Hamilton logo.
    # Make list of dictionaries.
    patpath = re.compile(PATHPAT)
    with open(svgfilepath, 'r') as f:
        paths = []
        # Line one.
        next(f)
        # Line two.
        next(f)
        for linex in f:
            print(PATHPAT)
            print(linex[:30])
            match = patpath.match(linex)
            if not match:
                break
            paths.append(parse_path(linex))
    return paths

def work_paths(paths, fcn):
    """
    Apply an operation to all coordinates in
    the bezier curve paths represented in
    paths.

    Also covers translate and fill.

    paths is a list of dictionaries. Each
    dictionary represents one line of an
    svg file with a path made up of 
    bezier curves.
    """
    # Return value.
    newpaths = []
    # M - start of path segment.
    for pthx in paths:
        newmpatgroups = []
        for coordsx in pthx['mpatgroups']:
            newmpatgroups.append([fcn(x) for x in coordsx])
        newpaths.append({'mpatgroups':newmpatgroups})
    # to Z - end of path segment.
    # List of path dictionaries (paths).
    for pthx, newpath in zip(paths, newpaths):
        newpath['paths'] = []
        # Each M to Z path segment.
        for path in pthx['paths']:
            curvegroup = []
            for curve in path:
                newcurve = [fcn(x) for x in curve]
                curvegroup.append(newcurve)
            newpath['paths'].append(curvegroup)
    # transform and fill.
    for pthx, newpath in zip(paths, newpaths):
        newpath['transform'] = [fcn(x) for x in pthx['transform']]
        newpath['fill'] = pthx['fill']
    return newpaths

def translate_paths(paths, translation):
    """
    From a two tuple of x, y translation,
    adjust dictionary values for x, y 
    translation in each path in path
    list accordingly.

    Returns new dictionary
    """
    # TRANSLATE
    # ['mpatgroups', 'paths', 'fill', 'translate']
    translated_paths = copy.deepcopy(paths)
    for pathx in translated_paths:
        pathx['transform'][0] += translation[0]
        pathx['transform'][1] += translation[1]
    return translated_paths

def get_path_strings(paths):
    """
    From a list of path dictionaries, 
    builds one line strings for insertion
    into svg file.

    Returns list
    """
    pathdict_2 = {'fill':None,
                'translatex':None,
                'translatey':None}
    path_segment_dict = {'mstartx':None,
                         'mstarty':None,
                         'path':None}
    pathstrings = []
    for pathx in paths:
        # Copy and initialize fill/translate dictionary.
        fill_translate = copy.deepcopy(pathdict_2)
        fill_translate['fill'] = pathx['fill']
        fill_translate['translatex'] = pathx['transform'][0]
        fill_translate['translatey'] = pathx['transform'][1]
        # Zip together M and path segments.
        path_segs = zip(pathx['mpatgroups'], pathx['paths'])
        # For each path segment.
        path_strings = []
        for M, path_seg in path_segs:
            seg_dict = copy.deepcopy(path_segment_dict)
            seg_dict['mstartx'] = M[0]
            seg_dict['mstarty'] = M[1]
            # Build path segment string.
            path_seg_strings = [BEZFMT.format(*coords) for coords in path_seg]
            path = ''.join(path_seg_strings)
            seg_dict['path'] = path
            # Make final segment string with M (PATHFMT_1)
            path_with_M = PATHFMT_1.format(**seg_dict)
            path_strings.append(path_with_M)
        path_all_together = ''.join(path_strings)
        # Tack on fill/translate at end and beginning with d path flag.
        # Add to pathstrings.
        pathstrings.append(PATHFMT_OPEN  +
                           path_all_together +
                           PATHFMT_2.format(**fill_translate) +
                           PATHFMT_CLOSE)
    return pathstrings  

paths = parse_all_paths('hamilton_logo_large.svg')
print('len(paths) = {0:d}'.format(len(paths)))
pprint.pprint([x for x in paths[0]])

paths = work_paths(paths, float)

scale = 0.035
scaleit = lambda x: scale * x

paths = work_paths(paths, scaleit)

# pprint.pprint(paths[0]['paths'][0])

paths = translate_paths(paths, (125, 0))

pathstrings = get_path_strings(paths)

# with open('test_paths.txt', 'w') as f:
#     for pathx in pathstrings:
#         print(pathx, file=f)
#         print('\n\n', file=f)

GAP_FIX = "<polygon points='265 90.25, 245 143.75, 268 144.1, 295 90.25' style='fill: black;' />"
GAP_FIX_PROPER_COLOR = "<polygon points='265 90.25, 245 143.75, 268 144.1, 295 90.25' style='fill: #C3368C;' />"

TEXT = '<p>{0:s}</p>'

CODE = """
<p>&nbsp;</p><pre style="background: rgb(238, 238, 238); border-bottom-color: initial; border-bottom-style: initial; border-image: initial; border-left-color: initial; border-left-style: initial; border-radius: 10px; border-right-color: initial; border-right-style: initial; border-top-color: rgb(221, 221, 221); border-top-style: solid; border-width: 5px 0px 0px; color: #444444; font-family: &quot;Courier New&quot;, Courier, monospace; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin-bottom: 1.5em; margin-top: 0px; overflow-wrap: normal; overflow: auto; padding: 12px; vertical-align: baseline;"><span style="font-size: 13px;">{0:s}</span></pre>
"""

textlist = ['blah blah blah',
            'blah blah blah again',
            'blah blah blah a third time',
            'blah blah blah a fourth time',
            'blah blah blah a fifth time',
            'blah blah blah a sixth time',
            'blah blah blah a seventh time',
            'fix divit',
            'final product']

fixed_divit = copy.deepcopy(pathstrings)
fixed_divit.insert(1, GAP_FIX_PROPER_COLOR)

svglist = [pathstrings[0:1],
           pathstrings[0:2],
           pathstrings[0:3],
           pathstrings[0:4],
           pathstrings[0:5],
           pathstrings[0:6],
           pathstrings[0:],
           pathstrings[0:] + [GAP_FIX],
           fixed_divit]

with open('blogpost.html', 'w') as f:
    for blah, svgels in zip(textlist, svglist):
        print(TEXT.format(blah), file=f)
        print(SVG_TAG_OPEN, file=f)
        for svgelement in svgels:
            print(svgelement, file=f)
        print(SVG_TAG_CLOSE, file=f) 

    print(TEXT.format('More blah about code'), file=f)

    print(CODE.format('>>> import this'), file=f)

print('Done')
Notes:

1) I had had good intentions of including a code box (the CODE string constant in the Python code) and I hit a bit of a wall. Not only is my code a mixed bag (this blog was always intended as a learning experience and a place for trying things out), it doesn't look good. We deal.

Which brings me to the point: you hear titles like front end developer, designer, website marketer etc. and think, "Well, it's kind of like art, kind of like coding . . . sort of creative." I now know, it's coding and it's thinking and it's grinding. All respect.
2) Steven Lott recently published a book that I bought (pdf). I have found it helpful. It's very pragmatic. I'm only about 15% of the way into it, but his treatment of regular expressions as just another tool not be scared of made me less tentative in my REGEX use (even if my REGEXes are far from elegant). Group capture with parens was really helpful for this exercise.
3) Our chief geology database administrator commented that the colors of the DAG Hamilton logo are those of the Spanish shawl nudibranch. There is a resemblance.
Photo courtesy of iNaturalist.



Thanks for stopping by.


Embedding an SVG in a graphviz Generated SVG and More DAG Hamilton
2024-08-09T18:39:00.000-07:00
Last time I used a previous post's DAG Hamilton graphviz output to generate a series of functionally highlighted DAG Hamilton workflow graphs. The  SVG (scalable vector graphics) versions of these graphs will serve as the input for this post.

I was dissatisfied with the quality of the PNG output, or at least how it rendered, fuzzy and illegible. My thought was that an SVG presentation would provide a more crisp, scalable (hence the name SVG) view of each graph.

Where I ran into problems was the embedded logos. Applications like PowerPoint allowed the inclusion of the logos as SGV "images" within the SVG "image" in PowerPoint, but did not render them; blank spaces remained.

So I set out to embed the SVG of the logos inline as elements within the final SVG file; it turned into quite the journey . . .

So SVG is really just XML, right? No, it is XML; it's just not just XML. There are XML tags and what is inside those tags can contain multiple SVG characteristics, all in their own syntax, most listed as quoted text.

At this point finding a library that allows for programmatic manipulation of SGV by tag or reviewing some open source browser source code may have helped. I did not do either of those things (a brief internet search yielded Python libraries, but they seemed focused more on conversion to and from SVG and other image formats) and set out on my own.

Like most people, I have played with Inkscape and converted images to SVG format. I even blogged about having done this with POVRay rendered pysanky eggs back in the day. Using something with software written by people way smarter than you and actually understanding it are two entirely different animals.

To make matters worse . . . I cannot actually display the SVG images or inline them here on Blogger. Smaller SVG snippets seem to work, but an entire graph with SVG logos is either too much or I am doing something wrong. Another (blurry) PNG example of the output will have to do.
Important concepts with links:

1) viewBox, scale, dimensions - Soueiden (classic, kind of the standard as far as I can tell):

https://www.sarasoueidan.com/blog/mimic-relative-positioning-in-svg/

2) the four quadrants of svg space (but you only see the lower right):

http://dh.obdurodon.org/coordinate-tutorial.xhtml

3) use x, y positioning to place embedded SVG rather than viewBox coordinates:

I have lost the link, but whoever suggested this, thank you.

4) (no link) Allow graphviz to do as much work as possible before editing any svg. For instance, when bolding edges of the graph in SVG, the edges will invariably overlap the nodes. This looks ugly. graphviz handles all that and it is far far simpler than trying to do it on your own.

5) bezier curves - nothing in this post about them, but they were part of my real introduction to SVG, and the most fun part. Recommend.

https://javascript.info/bezier-curve#de-casteljau-s-algorithm

Methodology for putting SVG logos inside the SVG document (not necessarily in order):

1) scale the embedded SVGs with the "width" and "height" attributes (SVG). I made mine proportional relative to the original SVGs' dimensions.

2) Calculate where the SVGs are supposed to go within the graphviz generated SVG coordinate space.

graphviz pushes everything into the upper right SVG space quadrant with an SVG "translate" command with 4 units padding. This needs to be taken into account when positioning the SVG elements relative to graphviz' coordinate space. The elements will be using the lower right SVG space quadrant coordinate space.

3) Leverage the positioning and size of the original PNG logos to place your SVG ones, then pop the old logo image elements and "erase" the boxes around them (yes, quite hacky, but effective).

This is a Python blog. Nutshell: I used xml.etree.ElementTree and rudimentary text processing of the SVG specific parts to get this done.

The whole thing got quite unwieldy and I turned once again to DAG Hamilton to help me organize and visualize things. (Blurry) screenshot below:




Wow, it looks like you just collected every piece of information you could about all the dimensions and smashed it all together in the final SVG document at the end.

Yes.

Hey, why is that one node just hanging out at a dead end not doing anything?

I was not getting the whole coordinate thing and needed it for reference.

The code:

# run.py - the DAG Hamilton control file.

"""
Hamilton wrapper.
"""

# https://hamilton.dagworks.io/en/latest/concepts/driver/#recap

import sys

import pprint

from hamilton import driver

import editsvgs as esvg

OUTPUTFILES = {'data_source_highlighted':'data_source_highlighted_final',
               'web_scraping_functions_highlighted':'web_scraping_functions_highlighted_final',
               'output_functions_highlighted':'output_functions_highlighted_final'}

dr = driver.Builder().with_modules(esvg).build()

dr.display_all_functions('esvg.svg', deduplicate_inputs=True, keep_dot=True, orient='BR')

for keyx in OUTPUTFILES:
    results = dr.execute(['hamilton_logo_root',
                          'company_logo_root',
                          'graph_root',
                          'doc_attrib',
                          'hamilton_logo_tree_indices',
                          'company_logo_tree_indices',
                          'hamilton_logo_png_attrib',
                          'company_logo_png_attrib',
                          'parsed_hamiltonlogo_png_attrib',
                          'parsed_companylogo_png_attrib',
                          'biggerdimension',
                          'dimensionratio',
                          'parsed_graph_dimensions',
                          'hamilton_logo_position',
                          'hamilton_svg_dimensions',
                          'hamilton_logo_dimensions_orig',
                          'biggerdimension_company',
                          'dimensionratio_company',
                          'company_logo_position',
                          'company_svg_dimensions',
                          'company_logo_dimensions_orig',
                          'final_svg_file'],
                          inputs={'hamiltonlogofile':'hamiltonlogolarge.svg',
                                  'companylogofile':'fauxcompanylogo.svg',
                                  'testfile':keyx + '.svg',
                                  'outputfile':OUTPUTFILES[keyx] + '.svg',
                                  'hamiltonlogopng':'hamiltonlogolarge.png',
                                  'companylogopng':'fauxcompanylogo.png'})
    
    print('\ndoc_attrib =\n')
    pprint.pprint(results['doc_attrib'])
    print('\nhamilton_logo_tree_indices =\n')
    pprint.pprint(results['hamilton_logo_tree_indices'])
    print('\ncompany_logo_tree_indices =\n')
    pprint.pprint(results['company_logo_tree_indices'])
    print('\nhamilton_logo_png_attrib =\n')
    pprint.pprint(results['hamilton_logo_png_attrib'])
    print('\nparsed_hamiltonlogo_png_attrib =\n')
    pprint.pprint(results['parsed_hamiltonlogo_png_attrib'])
    print('\ncompany_logo_png_attrib =\n')
    pprint.pprint(results['company_logo_png_attrib'])
    print('\nparsed_companylogo_png_attrib =\n')
    pprint.pprint(results['parsed_companylogo_png_attrib'])
    print('\nbiggerdimension =\n')
    pprint.pprint(results['biggerdimension'])
    print('\ndimensionratio =\n')
    pprint.pprint(results['dimensionratio'])
    print('\nparsed_graph_dimensions =\n')
    pprint.pprint(results['parsed_graph_dimensions'])
    print('\nhamilton_logo_position =\n')
    pprint.pprint(results['hamilton_logo_position'])
    print('\nhamilton_svg_dimensions =\n')
    pprint.pprint(results['hamilton_svg_dimensions'])
    print('\nhamilton_logo_dimensions_orig =\n')
    pprint.pprint(results['hamilton_logo_dimensions_orig'])
    print('\ndimensionratio_company =\n')
    pprint.pprint(results['dimensionratio_company'])
    print('\ncompany_logo_position =\n')
    pprint.pprint(results['company_logo_position'])
    print('\ncompany_svg_dimensions =\n')
    pprint.pprint(results['company_svg_dimensions'])
    print('\ncompany_logo_dimensions_orig =\n')
    pprint.pprint(results['company_logo_dimensions_orig'])
    print('\nfinal_svg_file =\n')
    pprint.pprint(results['final_svg_file'])

# editsvgs.py - DAG Hamilton noun-named functions.

# python 3.12

"""
Attempt to position svg logos and edit
flowchart with svg.
"""

import os

import pprint

import xml.etree.ElementTree as ET

import itertools

import sys

import copy

import reusedfunctions as rf

# Pop this to get rid of png image.
# '{http://www.w3.org/1999/xlink}href': 'hamiltonlogolarge.png'}
PNG_ATTRIB_KEY = '{http://www.w3.org/1999/xlink}href'
# '{http://www.w3.org/1999/xlink}href': 'fauxcompanylogo.png'}

def hamilton_logo_root(hamiltonlogofile:str) -> ET.Element:
    """
    Get root of ElementTree object for Hamilton
    logo svg file.

    hamiltonlogofile is the svg file with the Hamilton logo.
    """
    print('Getting Hamilton logo svg file root Element . . .')
    return rf.getroot(hamiltonlogofile)

def company_logo_root(companylogofile:str) -> ET.Element:
    """
    Get root of ElementTree object for company
    logo svg file.

    companylogofile is the svg file with the company logo.
    """
    print('Getting company logo svg file root Element . . .')
    return rf.getroot(companylogofile)

def graph_root(testfile:str) -> ET.Element:
    """
    Gets root Element of graphviz graph svg.

    testfile is the graphviz svg file.
    """
    print('Getting root Element of main graph svg file . . .')
    return rf.getroot(testfile)

def doc_attrib(graph_root:ET.Element) -> dict:
    """
    Gets graphviz svg document's dimensions and viewBox
    in a dictionary.

    graph_root is the graphviz svg file root Element. 

    Returns dictionary of xml/svg data for doc.
    """
    print('Getting dimensions and viewBox for main graph svg file . . .')
    return graph_root.attrib

def hamilton_logo_tree_indices(graph_root:ET.Element, hamiltonlogopng:str) -> tuple:
    """
    Get tree indices (3 deep) for original png
    Hamilton logo on graph.

    graph_root is the root Element of graphviz graph svg.

    hamiltonlogopng is the name of the png file referenced
    in the image link in the svg file (string).

    Returns 3 tuple of integers.
    """
    print('Getting ElementTree indices for tree for Hamilton png logo Element . . .')
    return rf.gettreeindices(graph_root, hamiltonlogopng)

def company_logo_tree_indices(graph_root:ET.Element, companylogopng:str) -> tuple:
    """
    Get tree indices (3 deep) for original png
    company logo on graph.

    graph_root is the root Element of graphviz graph svg.

    companylogopng is the name of the png file referenced
    in the image link in the svg file (string).

    Returns 3 tuple of integers.
    """
    print('Getting ElementTree indices for tree for company png logo Element . . .')
    return rf.gettreeindices(graph_root, companylogopng)

def hamilton_logo_png_attrib(graph_root:ET.Element, hamilton_logo_tree_indices:tuple) -> dict:
    """
    Get attrib dictionary for original Hamilton png file Element in graph svg.

    graph_root is the root Element of graphviz graph svg.

    hamilton_logo_tree_indices are the lookup indices for the Hamilton
    logo png Element within the xml tree.
    """
    print('Getting attrib dictionary for original Hamilton png file Element in graph svg . . .')
    return rf.getpngattrib(graph_root, hamilton_logo_tree_indices)

def company_logo_png_attrib(graph_root:ET.Element, company_logo_tree_indices:tuple) -> dict:
    """
    Get attrib dictionary for original company png file Element in graph svg.

    graph_root is the root Element of graphviz graph svg.

    company_logo_tree_indices are the lookup indices for the company
    logo png Element within the xml tree.
    """
    print('Getting attrib dictionary for original company png file Element in graph svg . . .')
    return rf.getpngattrib(graph_root, company_logo_tree_indices)

def parsed_hamiltonlogo_png_attrib(hamilton_logo_png_attrib:dict) -> dict:
    """
    Work dictionary that has information on former
    location of png Hamilton logo image in the 
    graphviz svg.

    Basically getting svg text values into float format.

    Returns new dictionary.
    """
    print('Getting svg text values into float format for Hamilton png Element . . .')
    return rf.parsepngattrib(hamilton_logo_png_attrib)

def parsed_companylogo_png_attrib(company_logo_png_attrib:dict) -> dict:
    """
    Work dictionary that has information on former
    location of png company logo image in the 
    graphviz svg.

    Basically getting svg text values into float format.

    Returns new dictionary.
    """
    print('Getting svg text values into float format for company logo png Element . . .')
    return rf.parsepngattrib(company_logo_png_attrib)

def biggerdimension(hamilton_logo_root:ET.Element) -> str:
    """
    hamilton_logo_root is the ElementTree Element
    for the big svg Hamilton logo.

    Returns 'Y' if the y dimension is the 
    bigger one, and 'X' if the x one is.

    Returns None if there is a key error.
    """
    print('Determining bigger dimension for svg Hamilton logo . . .')
    return rf.getbiggerdimension(hamilton_logo_root)

def dimensionratio(biggerdimension:str, hamilton_logo_root:ET.Element) -> float:
    """
    biggerdimension is a string, 'X' or 'Y'.

    hamilton_logo_root is the ElementTree Element
    for the big svg Hamilton logo.

    Returns ratio of bigger dimension
    to smaller one (float).
    """
    print('Calculating dimensions ratio for Hamilton logo svg . . .')
    return rf.getdimensionratio(biggerdimension, hamilton_logo_root)

def parsed_graph_dimensions(graph_root:ET.Element) -> tuple:
    """
    Get translate coordinates from graphviz
    svg root Element.

    Returns two tuple of x, y translation.
    """
    graph0dimensions = graph_root[0].attrib
    coordstr = graph0dimensions['transform']
    coordstr = coordstr[coordstr.index('translate'):]
    coordstr = coordstr[coordstr.index('(') + 1:-1]
    vals = [float(x) for x in coordstr.split(' ')]
    return tuple(vals)

def hamilton_logo_position(parsed_graph_dimensions:tuple,
                           parsed_hamiltonlogo_png_attrib:dict) -> tuple:
    """
    parsed_graph_dimensions is an x, y two tuple.

    parsed_hamiltonlogo_png_attrib is a dictionary.

    Returns x, y position of Hamilton logo svg graphic as a
    two tuple.
    """
    print('Getting position of Hamilton logo . . .')
    return rf.getposition(parsed_graph_dimensions, parsed_hamiltonlogo_png_attrib)

def company_logo_position(parsed_graph_dimensions:tuple,
                          parsed_companylogo_png_attrib:dict) -> tuple:
    """
    parsed_graph_dimensions is an x, y two tuple.

    parsed_hamiltonlogo_png_attrib is a dictionary.

    Returns x, y position of company logo svg graphic as a
    two tuple.
    """
    print('Getting position of company logo . . .')
    # Add 4.
    x = parsed_companylogo_png_attrib['X'] + parsed_graph_dimensions[0]
    # Add negative number with big absolute value.
    # Upper right quadrant translation thing.
    y = parsed_companylogo_png_attrib['Y'] + parsed_graph_dimensions[1] 
    return x, y

def hamilton_svg_dimensions(parsed_hamiltonlogo_png_attrib:dict,
                            biggerdimension:str,
                            dimensionratio:float) -> tuple:
    """
    Get width and height of svg Hamilton logo within
    final document.

    parsed_hamiltonlogo_png_attrib is the dictionary of
    numeric values associated with the original image
    position of the Hamilton png logo within the
    svg document.

    biggerdimension is the 'X' or 'Y' value that
    indicates which dimension is the larger of
    the two.

    dimensionratio is the ratio of the larger
    dimension to the smaller one.

    Returns x, y two tuple of floats.
    """
    print('Getting size of Hamilton logo in final doc . . .')
    return rf.getdimensions(parsed_hamiltonlogo_png_attrib, biggerdimension, dimensionratio)

def hamilton_logo_dimensions_orig(hamilton_logo_root:ET.Element) -> tuple:
    """
    hamilton_logo_root is the ElementTree Element
    for the big svg Hamilton logo.

    Returns two tuple of width, height.
    """
    print('Retrieving dimensions of Hamilton logo svg . . .')
    return rf.getdimensionsorig(hamilton_logo_root)

def final_svg_file(testfile:str,
                   outputfile:str,
                   hamilton_logo_root:ET.Element,
                   hamilton_logo_tree_indices:tuple,
                   hamilton_logo_position:tuple,
                   hamilton_svg_dimensions:tuple,
                   hamilton_logo_dimensions_orig:tuple,
                   company_logo_tree_indices:tuple,
                   parsed_companylogo_png_attrib:dict,
                   company_logo_position:tuple,
                   company_logo_root:ET.Element,
                   company_svg_dimensions:tuple,
                   company_logo_dimensions_orig:tuple,
                   ) -> str:
    """
    Replaces image logos with scaleable
    svg ones.

    testfile is the name of the original svg file.

    outputfile is the name of the intended final
    svg file.

    hamilton_logo_root is the elementree root object
    for the Hamilton logo svg file.

    hamilton_logo_tree_indices are nested indices 
    indicating the location of the original Hamilton
    logo png elementree Element within the input
    svg document.

    hamilton_logo_position - x, y tuple - where to
    put the svg Hamilton logo within the final
    svg document.

    hamilton_svg_dimensions - x, y tuple - width and
    height of Hamilton svg logo within the final
    svg document.

    hamilton_logo_dimensions_orig - two tuple of width, height
    of original svg file Hamilton logo.

    company_logo_tree_indices are nested indices 
    indicating the location of the original company
    logo png elementree Element within the input
    svg document.

    company_logo_position - x, y tuple - where to
    put the svg company logo within the final
    svg document.

    company_logo_root is the elementree root object
    for the company logo svg file.
 
    company_svg_dimensions - x, y tuple - width and
    height of company svg logo within the final
    svg document.

    company_logo_dimensions_orig - two tuple of width, height
    of original svg file company logo.

    Returns string filename.
    """
    print('Making changes to svg . . .')
    hlti = hamilton_logo_tree_indices 
    retval = outputfile
    tree = ET.parse(testfile)
    root = tree.getroot()
    # pop Hamilton png
    print('Popping original Hamilton png logo . . .')
    root[hlti[0]][hlti[1]][hlti[2]].attrib.pop(PNG_ATTRIB_KEY)
    print('Appending Hamilton svg to root Element . . .')
    root.append(hamilton_logo_root)
    print('Adjusting viewBox for Hamilton svg . . .')
    root[-1].attrib['viewBox'] = '0.00 0.00 {0:.3f} {1:.3f}'.format(*hamilton_logo_dimensions_orig)
    print('Adjusting height and width for Hamilton svg . . .')
    root[-1].attrib['height'] = str(hamilton_svg_dimensions[1])
    root[-1].attrib['width'] = str(hamilton_svg_dimensions[0])
    print('Positioning Hamilton logo svg within final svg . . .')
    root[-1].attrib['x'] = str(hamilton_logo_position[0])
    root[-1].attrib['y'] = str(hamilton_logo_position[1])
    print('Erasing Hamilton logo bounding box . . .')
    # After popping png, polygon resides one index unit back.
    root[hlti[0]][hlti[1]][hlti[2] - 1].attrib['stroke'] = 'none'
    clti = company_logo_tree_indices
    # pop company png
    print('Popping original company png logo . . .')
    root[clti[0]][clti[1]][clti[2]].attrib.pop(PNG_ATTRIB_KEY)
    print('Adding company logo svg Element to main svg file . . .')
    root.append(company_logo_root)
    print('Adjusting viewBox for company svg . . .')
    root[-1].attrib['viewBox'] = '0.00 0.00 {0:.3f} {1:.3f}'.format(*company_logo_dimensions_orig)
    print('Adjusting height and width for company svg . . .')
    root[-1].attrib['height'] = str(company_svg_dimensions[1])
    root[-1].attrib['width'] = str(company_svg_dimensions[0])
    print('Moving company logo svg to the correct position in the display . . .')
    # Had to adjust 15 units to get it out of the way of the legend.
    root[-1].attrib['x'] = str(company_logo_position[0] - 15)
    root[-1].attrib['y'] = str(company_logo_position[1])
    print('Erasing company logo bounding box . . .')
    # After popping png, polygon resides one index unit back.
    root[clti[0]][clti[1]][clti[2] - 1].attrib['stroke'] = 'none'
    print('Writing new svg . . .')
    tree.write(retval)
    return retval

def biggerdimension_company(company_logo_root:ET.Element) -> str:
    """
    company_logo_root is the ElementTree Element
    for the big svg company logo.

    Returns 'Y' if the y dimension is the 
    bigger one, and 'X' if the x one is.
    """
    print('Determining bigger dimension for svg company logo . . .')
    return rf.getbiggerdimension(company_logo_root)

def dimensionratio_company(biggerdimension_company:str,
                           company_logo_root:ET.Element) -> float:
    """
    biggerdimension is a string, 'X' or 'Y'.

    company_logo_root is the ElementTree Element
    for the big svg company logo.

    Returns ratio of bigger dimension
    to smaller one (float).
    """
    print('Calculating dimensions ratio for company logo svg . . .')
    return rf.getdimensionratio(biggerdimension_company,
                                company_logo_root)

def company_logo_position(parsed_graph_dimensions:tuple,
                          parsed_companylogo_png_attrib:dict) -> tuple:
    """
    parsed_graph_dimensions is an x, y two tuple.

    parsed_companylogo_png_attrib is a dictionary.

    Returns x, y position of company logo svg graphic as a
    two tuple.
    """
    print('Getting position of company logo . . .')
    return rf.getposition(parsed_graph_dimensions,
                          parsed_companylogo_png_attrib)

def company_svg_dimensions(parsed_companylogo_png_attrib:dict,
                           biggerdimension_company:str,
                           dimensionratio_company:float) -> tuple:
    """
    Get width and height of svg company logo within
    final document.

    parsed_companylogo_png_attrib is the dictionary of
    numeric values associated with the original image
    position of the company png logo within the
    svg document.

    biggerdimension is the 'X' or 'Y' value that
    indicates which dimension is the larger of
    the two.

    dimensionratio_company is the ratio of the larger
    dimension to the smaller one.

    Returns x, y two tuple of floats.
    """
    pprint.pprint(parsed_companylogo_png_attrib)
    print('Getting size of company logo in final doc . . .')
    return rf.getdimensions(parsed_companylogo_png_attrib, 
                            biggerdimension_company,
                            dimensionratio_company)

def company_logo_dimensions_orig(company_logo_root:ET.Element) -> tuple:
    """
    company_logo_root is the ElementTree Element
    for the big svg company logo.

    Returns two tuple of width, height.
    """
    print('Retrieving dimensions of company logo svg . . .')
    return rf.getdimensionsorig(company_logo_root)

# reusedfunctions.py - utility/helper/main functionality
#                      at a granular level.

# python 3.12

"""
Auxiliary module to Hamilton svg script.
"""

import itertools

import xml.etree.ElementTree as ET

# Pop this to get rid of png image.
# '{http://www.w3.org/1999/xlink}href': 'hamiltonlogolarge.png'}
PNG_ATTRIB_KEY = '{http://www.w3.org/1999/xlink}href'

def gettreeindices(graph_root, png):
    """
    Get tree indices (3 deep) for png on graph.

    graph_root is the root Element of graphviz graph svg.

    png is the name of the png file referenced
    in the image link in the svg file (string).

    Returns 3 tuple of integers.
    """
    countergeneratorx = itertools.count()
    counterx = next(countergeneratorx)
    for nodex in graph_root:
        countergeneratory = itertools.count()
        countery = next(countergeneratory)
        for nodey in nodex:
            countergeneratorz = itertools.count()
            counterz = next(countergeneratorz)
            for nodez in nodey:
                if PNG_ATTRIB_KEY in nodez.attrib:
                    if nodez.attrib[PNG_ATTRIB_KEY] == png:
                        return counterx, countery, counterz
                counterz = next(countergeneratorz)
            countery = next(countergeneratory)
        counterx = next(countergeneratorx)

def getroot(filename):
    """
    Get root of ElementTree object for svg file.

    filename is the svg file string.
    """
    return ET.parse(filename).getroot()

def getpngattrib(graph_root, indices):
    """
    Get attrib dictionary for png file Element in graph svg.

    graph_root is the root Element of graphviz graph svg.

    indices are the lookup indices for the 
    png Element within the xml tree.
    """
    return graph_root[indices[0]][indices[1]][indices[2]].attrib

def parsepngattrib(attrib):
    """
    Work dictionary that has information on
    location of png image in the 
    graphviz svg.

    Basically getting svg text values into float format.

    Returns new dictionary.
    """
    retval = {}
    retval['X'] = float(attrib['x'])
    retval['Y'] = float(attrib['y'])
    retval['height'] = float(attrib['height'][:attrib['height'].index('px')])
    retval['width'] = float(attrib['width'][:attrib['width'].index('px')])
    return retval

def getbiggerdimension(root):
    """
    root is the ElementTree Element
    for the svg file element to be
    embedded into the main svg file.

    Returns 'Y' if the y dimension is the 
    bigger one, and 'X' if the x one is.

    Returns None if there is a key error.
    """
    dimensions = root.attrib
    try:
        if float(dimensions['height']) > float(dimensions['width']):
            return 'Y'
        else:
            # X bigger or equal
            return 'X'
    except ValueError:
        pass
    return None

def getdimensionratio(biggerdimension, root):
    """
    biggerdimension is a string, 'X' or 'Y'.

    root is the etree Element for the
    svg Element that is to be embedded
    into the final svg file

    Returns ratio of bigger dimension
    to smaller one (float).
    """
    dimensions = root.attrib
    if biggerdimension == 'Y':
        return float(dimensions['height']) / float(dimensions['width'])
    else:
        return float(dimensions['width']) / float(dimensions['height'])

def getposition(dimensions, attrib):
    """
    dimensions is an x, y two tuple.

    attrib is a dictionary.

    Returns x, y position of svg graphic as a
    two tuple.
    """
    # Add 4.
    x = attrib['X'] + dimensions[0]
    # Add negative number with big absolute value.
    # Upper right quadrant translation thing.
    y = attrib['Y'] + dimensions[1] 
    return x, y

def getdimensions(attrib, biggerdimension, dimensionratio):
    """
    Get width and height of svg within
    final document.

    attrib is the dictionary of numeric values
    associated with the original image
    position of the png within the
    svg document.

    biggerdimension is the 'X' or 'Y' value that
    indicates which dimension is the larger of
    the two.

    dimensionratio is the ratio of the larger
    dimension to the smaller one.

    Returns x, y two tuple of floats.
    """
    if biggerdimension == 'Y':
        return (attrib['width'],  dimensionratio * attrib['width'])
    else:
        return (dimensionratio * attrib['height'], attrib['height'])

def getdimensionsorig(root):
    """
    root is the ElementTree Element
    for the svg Element to be embedded 
    in the main svg file.

    Returns two tuple of width, height.
    """
    return float(root.attrib['width']), float(root.attrib['height'])

# OUTPUT (stdout)

Getting Hamilton logo svg file root Element . . .
Getting company logo svg file root Element . . .
Getting root Element of main graph svg file . . .
Getting dimensions and viewBox for main graph svg file . . .
Getting ElementTree indices for tree for Hamilton png logo Element . . .
Getting ElementTree indices for tree for png Element . . .
Getting ElementTree indices for tree for company png logo Element . . .
Getting ElementTree indices for tree for png Element . . .
Getting attrib dictionary for original Hamilton png file Element in graph svg . . .
Getting attrib dictionary for original company png file Element in graph svg . . .
Getting svg text values into float format for Hamilton png Element . . .
Getting svg text values into float format for company logo png Element . . .
Determining bigger dimension for svg Hamilton logo . . .
Calculating dimensions ratio for Hamilton logo svg . . .
Getting position of Hamilton logo . . .
Getting size of Hamilton logo in final doc . . .
Retrieving dimensions of Hamilton logo svg . . .
Determining bigger dimension for svg company logo . . .
Calculating dimensions ratio for company logo svg . . .
Getting position of company logo . . .
Getting size of company logo in final doc . . .
Retrieving dimensions of company logo svg . . .
Making changes to svg . . .
Popping original Hamilton png logo . . .
Appending Hamilton svg to root Element . . .
Adjusting viewBox for Hamilton svg . . .
Adjusting height and width for Hamilton svg . . .
Positioning Hamilton logo svg within final svg . . .
Erasing Hamilton logo bounding box . . .
Popping original company png logo . . .
Adding company logo svg Element to main svg file . . .
Adjusting viewBox for company svg . . .
Adjusting height and width for company svg . . .
Moving company logo svg to the correct position in the display . . .
Erasing company logo bounding box . . .
Writing new svg . . .

doc_attrib =

{'height': '825pt', 'viewBox': '0.00 0.00 936.00 824.60', 'width': '936pt'}

hamilton_logo_tree_indices =

(0, 4, 2)

company_logo_tree_indices =

(0, 5, 2)

hamilton_logo_png_attrib =

{'height': '43.2px',
 'preserveAspectRatio': 'xMinYMin meet',
 'width': '43.2px',
 'x': '218.3',
 'y': '-673.9',
 '{http://www.w3.org/1999/xlink}href': 'hamiltonlogolarge.png'}

parsed_hamiltonlogo_png_attrib =

{'X': 218.3, 'Y': -673.9, 'height': 43.2, 'width': 43.2}

company_logo_png_attrib =

{'height': '43.2px',
 'preserveAspectRatio': 'xMinYMin meet',
 'width': '367.2px',
 'x': '279.3',
 'y': '-673.9',
 '{http://www.w3.org/1999/xlink}href': 'fauxcompanylogo.png'}

parsed_companylogo_png_attrib =

{'X': 279.3, 'Y': -673.9, 'height': 43.2, 'width': 367.2}

biggerdimension =

'X'

dimensionratio =

1.0421153385977506

parsed_graph_dimensions =

(4.0, 820.6)

hamilton_logo_position =

(222.3, 146.70000000000005)

hamilton_svg_dimensions =

(45.01938262742283, 43.2)

hamilton_logo_dimensions_orig =

(8710.0, 8358.0)

dimensionratio_company =

9.047619047619047

company_logo_position =

(283.3, 146.70000000000005)

company_svg_dimensions =

(390.8571428571429, 43.2)

company_logo_dimensions_orig =

(712.5, 78.75)

final_svg_file =

'data_source_highlighted_final.svg'

# . . . etc. 2 more times.

Note on DAG Hamilton: my use case for this tool is very rudimentary and somewhat pedestrian. That said, it is becoming essential to my workflows.

The DAG Hamilton project is still at its relatively early stages with some very exciting active development ongoing. It seems like every week some amazing new decorator feature gets released.

I am not much of one for decorators use - grateful for their existence and use in the 3rd party modules I use. Truthfully, 3/4 of the work I do could probably be accomplished with a relatively recent version of Python and dictionaries.

Where DAG Hamilton helps me out a lot is in corralling and organizing code. I tend to get a bit undisciplined and have trouble "seeing" the execution path. DAG Hamilton helps there.

Thanks for stopping by.



Graphviz - Editing a DAG Hamilton Graph dot File
2024-07-05T18:10:00.000-07:00
Last post featured the DAG Hamilton generated graphviz graph shown below. I'll be dressing this up a little and highlighting some functionality. For the toy example here, the script employed is a bit of overkill. For a bigger workflow, it may come in handy.





I'll start with the finished products:

1) A Hamilton logo and a would be company logo get added (manual; the Data Inputs Highlighted subtitle is there for later processing when we highlight functionality.)

2) through 4) are done programmatically (code is shown further down). I saw an example on the Hamilton web pages that used aquamarine as the highlight color; I liked that, so I stuck with it.

2) Data source and data source function highlighted.



3) Web scraping functions highlighted.



4) Output nodes highlighted.



A few observations and notes before we look at configuration and code: I've found the charts to be really helpful in presenting my workflow to users and leadership (full disclosure: my boss liked some initial charts I made; my dream of the PowerPoint to solve all scripter<->customer communication challenges is not yet reality, but for the first time in a long time, I have hope.) 

In the web scraping highlighted diagram, you can pretty clearly see that data_with_company node has an input into the commodity_word_counts node. The domain specific rationale from the last blog post is that I don't want to count every "Barrick Gold" company name occurrence as another mention of "Gold" or "gold."

Toy example notwithstanding, in real life, being able to show where something branches critically is a real help. Assumptions about what a script is actually doing versus what it is doing can actually be costly in terms of time and productivity for all parties. Being able to say and show ideas like, "What it's doing over here doesn't carry over to that other mission critical part you're really concerned with; it's only for purposes of the visualization which lies over here on the diagram" or "This node up here representing <the real life thing> is your sole source of input for this script; it is not looking at <other real world thing> at all."

graphviz and diagrams like this have been around for decades - UML, database schema visualizations, etc. What makes this whole DAG Hamilton thing better for me is how easy and accessible it is. I've seen C++ UML diagrams over the years (all respect to the C++ people - it takes a lot of ability, discipline, and effort); my first thought is often, "Oh wow . . . I'm not sure I have what it takes to do that . . . and I'm not sure I'd want to . . ."

Enough rationalization and qualifying - on to the config and the code!

I added the title and logos manually. The assumption that the graphviz dot file output of DAG Hamilton will always be in the format shown would be premature and probably wrong. It's an implementation detail subject to change and not a feature. That said, I needed some features in my graph outputs and I achieved them this one time.

Towards the top of the dot file is where the title goes:

// Dependency Graph
digraph {
        labelloc="t"
        label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica

labelalloc="t" puts the text at the top of the graph (t for top, I think).

// Dependency Graph
digraph {
        labelloc="t"
        label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica
        hamiltonlogo [label="" image="hamiltonlogolarge.png" shape="box", width=0.6, height=0.6, fixedsize=true]
        companylogo [label="" image="fauxcompanylogo.png" shape="box", width=5.10 height=0.6 fixedsize=true]

The DAG Hamilton logo listed first appears to end up in the upper left part of the diagram most of the time (this is an empirical observation on my part; I don't have a super great handle on the internals of graphviz yet).

Getting the company logo next to it requires a bit more effort. A StackOverflow exchange had a suggestion of connecting it invisibly to an initial node. In this case, that would be the data source. Inputs in DAG Hamilton don't get listed in the graphviz dot file by their names, but rather by the node or nodes they are connected to: _parsed_data_inputs instead of "datafile" like you might expect. I have a preference for listing my input nodes only once (deduplicate_inputs=True is the keyword argument to DAG Hamilton's driver object's display_all_functions method that makes the graph).

The change is about one third of the way down the dot file where the node connection edges start getting listed:

	parsed_data -> data_with_wikipedia
	_parsed_data_inputs [label=<<table border="0"><tr><td>datafile</td><td>str</td></tr></table>> fontname=Helvetica margin=0.15 shape=rectangle style="filled,dashed" fillcolor="#ffffff"]
        companylogo -> _parsed_data_inputs [style=invis]

DAG Hamilton has a dashed box for script inputs. That's why there is all that extra description inside the square brackets for that node. I manually added the fillcolor="#ffffff" at the end. It's not necessary for the chart (I believe the default fill of white /#ffffff was specified near the top of the file), but it is necessary for the code I wrote to replace the existing color with something else. Otherwise, it does not affect the output.

I think that's it for manual prep.

Onto the code. Both DAG Hamilton and graphviz have API's for customizing the graphviz dot file output. I've opted to approach this with brute force text processing. For my needs, this is the best option. YMMV. In general, text processing any code or configuration tends to be brittle. It worked this time.

# python 3.12

"""
Try to edit properties of graphviz output.
"""

import sys

import re

import itertools

import graphviz

INPUT = 'ts_with_logos_and_colors'

FILLCOLORSTRLEN = 12
AQUAMARINE = '7fffd4'
COLORLEN = len(AQUAMARINE)

BOLDED = ' penwidth=5'
BOLDEDEDGE = ' [penwidth=5]'

NODESTOCOLOR = {'data_source':['_parsed_data_inputs',
                               'parsed_data'],
                'webscraping':['data_with_wikipedia',
                               'colloquial_company_word_counts',
                               'data_with_company',
                               'commodity_word_counts'],
                'output':['info_output',
                          'info_dict_merged',
                          'wikipedia_report']}

EDGEPAT = r'\b{0:s}\b[ ][-][>][ ]\b{1:s}\b'

TITLEPAT = r'Toy Web Scraping Script Run Diagram[<]BR[/][>]'
ENDTITLEPAT = r'</b>>'

# Two tuples as values for edges.
EDGENODESTOBOLD = {'data_source':[('_parsed_data_inputs', 'parsed_data')],
                   'webscraping':[('data_with_wikipedia', 'colloquial_company_word_counts'),
                                  ('data_with_wikipedia', 'data_with_company'),
                                  ('data_with_wikipedia', 'commodity_word_counts'),
                                  ('data_with_company', 'commodity_word_counts')],
                   'output':[('data_with_company', 'info_output'),
                             ('colloquial_company_word_counts', 'info_dict_merged'),
                             ('commodity_word_counts', 'info_dict_merged'),
                             ('info_dict_merged', 'wikipedia_report'),
                             ('data_with_company', 'info_dict_merged')]}

OUTPUTFILES = {'data_source':'data_source_highlighted',
               'webscraping':'web_scraping_functions_highlighted',
               'output':'output_functions_highlighted'}

TITLES = {'data_source':'Data Sources and Data Source Functions Highlighted',
          'webscraping':'Web Scraping Functions Highlighted',
          'output':'Output Functions Highlighted'}

def get_new_source_nodecolor(src, nodex):
    """
    Return new source string for graphviz
    with selected node colored aquamarine.

    src is the original graphviz text source
    from file.

    nodex is the node to have it's color edited.
    """
    # Full word, exact match.
    wordmatchpat = r'\b' + nodex + r'\b'
    pat = re.compile(wordmatchpat)
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(pat, src)
    # nodeidx = src.find(nodex)
    nodeidx = match.span()[0]
    print('nodeidx = ', nodeidx)
    src2 += src[:nodeidx]
    idxcolor = src[nodeidx:].find('fillcolor')
    print('idxcolor = ', idxcolor)
    # fillcolor="#b4d8e4"
    # 012345678901234567
    src2 += src[nodeidx:nodeidx + idxcolor + FILLCOLORSTRLEN]
    src2 += AQUAMARINE
    currentposit = nodeidx + idxcolor + FILLCOLORSTRLEN + COLORLEN
    src2 += src[currentposit:]
    return src2

def get_new_title(src, title):
    """
    Return new source string for graphviz
    with new title part of header.

    src is the original graphviz text source
    from file.

    title is a string.
    """
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(TITLEPAT, src)
    titleidx = match.span()[1]
    print('titleidx = ', titleidx)
    src2 += src[:titleidx]
    idxendtitle = src[titleidx:].find(ENDTITLEPAT)
    print('idxendtitle = ', idxendtitle)
    src2 += title
    currentposit = titleidx + idxendtitle
    print('currentposit = ', currentposit)
    src2 += src[currentposit:]
    return src2

def get_new_source_penwidth_nodes(src, nodex):
    """
    Return new source string for graphviz
    with selected node having bolded border.

    src is the original graphviz text source
    from file.

    nodex is the node to have its box bolded.
    """
    # Full word, exact match.
    wordmatchpat = r'\b' + nodex + r'\b'
    pat = re.compile(wordmatchpat)
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(pat, src)
    nodeidx = match.span()[0]
    print('nodeidx = ', nodeidx)
    src2 += src[:nodeidx]
    idxbracket = src[nodeidx:].find(']')
    src2 += src[nodeidx:nodeidx + idxbracket]
    print('idxbracket = ', idxbracket)
    src2 += BOLDED
    src2 += src[nodeidx + idxbracket:]
    return src2

def get_new_source_penwidth_edges(src, nodepair):
    """
    Return new source string for graphviz
    with selected node pair having bolded edge.

    src is the original graphviz text source
    from file.

    nodepair is the two node tuple to have
    its edge bolded.
    """
    # Full word, exact match.
    edgepat = EDGEPAT.format(*nodepair)
    print(edgepat)
    pat = re.compile(edgepat)
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(pat, src)
    edgeidx = match.span()[1]
    print('edgeidx = ', edgeidx)
    src2 += src[:edgeidx]
    src2 += BOLDEDEDGE 
    src2 += src[edgeidx:]
    return src2

def makehighlightedfuncgraphs():
    """
    Cycle through functionalities to make specific
    highlighted functional parts of the workflow
    output graphs.

    Returns dictionary of new filenames.
    """
    with open(INPUT, 'r') as f:
        src = f.read()

    retval = {}
    
    for functionality in TITLES:
        print(functionality)
        src2 = src
        retval[functionality] = {'dot':None,
                                 'svg':None,
                                 'png':None}
        src2 = get_new_title(src, TITLES[functionality])
        # list of nodes.
        to_process = (nodex for nodex in NODESTOCOLOR[functionality])
        countergenerator = itertools.count()
        count = next(countergenerator)
        print('\nDoing node colors\n')
        for nodex in to_process:
            print(nodex)
            src2 = get_new_source_nodecolor(src2, nodex)
            count = next(countergenerator)
        to_process = (nodex for nodex in NODESTOCOLOR[functionality])
        countergenerator = itertools.count()
        count = next(countergenerator)
        print('\nDoing node bolding\n')
        for nodex in to_process:
            print(nodex)
            src2 = get_new_source_penwidth_nodes(src2, nodex)
            count = next(countergenerator)
        print('Bolding edges . . .')
        to_process = (nodex for nodex in EDGENODESTOBOLD[functionality])
        countergenerator = itertools.count()
        count = next(countergenerator)
        for nodepair in to_process:
            print(nodepair)
            src2 = get_new_source_penwidth_edges(src2, nodepair)
            count = next(countergenerator)
        print('Writing output files . . .')
        outputfile = OUTPUTFILES[functionality]
        with open(outputfile, 'w') as f:
            f.write(src2)
        graphviz.render('dot', 'png', outputfile)
        graphviz.render('dot', 'svg', outputfile)

makehighlightedfuncgraphs()

Thanks for stopping by.




DAG Hamilton Workflow for Toy Text Processing Script
2024-07-04T15:04:00.000-07:00
Hello. It's been a minute.

I was fortunate to attend PYCON US in Pittsburgh earlier this year. DAGWorks had a booth on the expo floor where I discovered Hamilton. The project grabbed my attention as something that could help organize and present my code workflow better. My reaction could be compared to browsing Walmart while picking up a hardware item and seeing the perfect storage medium for your clothes or crafts at a bargain price, but even better, having someone there to explain the whole thing to you. The folks at the booth were really helpful.




Below I take on a contrived web scraping (it's crude) script in my domain (metals mining) and create a Hamilton workflow from it.

Pictured below is the Hamilton flow in the graphviz output format the project uses for flowcharts (graphviz has been around for decades - an oldie but goodie as it were).





I start with a csv file that has some really basic data on three big American metal mines (I did have to research the Wikipedia addresses - for instance, I originally looked for the Goldstrike Mine under the name "Post-Betze." It goes by several different names and encompasses several mines - more on that anon):

mine,state,commodity,wikipedia page,colloquial association
Red Dog,Alaska,zinc,https://en.wikipedia.org/wiki/Red_Dog_mine,Teck
Goldstrike,Nevada,gold,https://en.wikipedia.org/wiki/Goldstrike_mine,Nevada Gold Mines
Bingham Canyon,Utah,copper,https://en.wikipedia.org/wiki/Bingham_Canyon_Mine,Kennecott

Basically, I am going to attempt to scrape Wikipedia for information on who owns the three mines. Then I will try to use heuristics to gather information on what I think I know about them and gauge how up to date the Wikipedia information is.

Hamilton uses a system whereby you name your functions in a noun-like fashion ("def stuff()" instead of "def getstuff()") and feed those names as variables to the other functions in the workflow as parameters. This is what allows the tool to check your workflow for inconsistencies (types, for instance) and build the graphviz chart shown above.

You can use separate modules with functions and import them. I've done some of this on the bigger workflows I work with. Your Hamilton functions then end up being little one liners that call the bigger functions in the modules. This is necessary if you have functions you use repeatedly in your workflow that take different values at different stages. For this toy project, I've kept the whole thing self contained in one module toyscriptiii.py (yes, the iii in the filename represents my multiple failed attempts at web scraping and text processing - it's harder than it looks).

Below is the Hamilton main file run.py (I believe the "run.py" name is convention.) I have done my best to preserve the dictionary return values as "faux immutable" through use of the copy module in each function. This helps me in debugging and examining output, much of which can be done from the run.py file (all the return values are stored in a dictionary). I've worked with a dataset with about 600,000 rows that had about 10 nodes. My computer has 32GB of RAM (Windows 11); it handled memory fine (less than half). For really big data, keeping all these dictionaries in memory might be a problem.

# python 3.12

"""
Hamilton demo.
"""

import sys

import pprint

from hamilton import driver

import toyscriptiii as ts

dr = driver.Builder().with_modules(ts).build()

dr.display_all_functions("ts.png", deduplicate_inputs=True, keep_dot=True, orient='BR')

results = dr.execute(['parsed_data',
                      'data_with_wikipedia',
                      'data_with_company',
                      'info_output',
                      'commodity_word_counts',
                      'colloquial_company_word_counts',
                      'info_dict_merged',
                      'wikipedia_report'],
                      inputs={'datafile':'data.csv'})

pprint.pprint(results['info_dict_merged'])
print(results['info_output'])
print(results['wikipedia_report'])

The main toy module with functions configured for the Hamilton graph:

# python 3.12

"""
Toy script.

Takes some input from a csv file on big American
mines and looks at Wikipedia text for some extra
context.
"""

import copy

import pprint

import sys

from urllib import request

import re

from bs4 import BeautifulSoup

def parsed_data(datafile:str) -> dict:
    """
    Get csv data into a dictionary keyed on mine name.
    """
    retval = {}
    with open(datafile, 'r') as f:
        headers = [x.strip() for x in next(f).split(',')]
        for linex in f:
            vals = [x.strip() for x in linex.split(',')]
            retval[vals[0]] = {key:val for key, val in zip(headers, vals)} 
    pprint.pprint(retval)
    return retval
        
def data_with_wikipedia(parsed_data:dict) -> dict:
    """
    Connect to wikipedia sites and fill in
    raw html data.

    Return dictionary.
    """
    retval = copy.deepcopy(parsed_data)
    for minex in retval:
        obj = request.urlopen(retval[minex]['wikipedia page'])
        html = obj.read()
        soup = BeautifulSoup(html, 'html.parser')
        print(soup.title)
        # Text from html and strip out newlines.
        newstring = soup.get_text().replace('\n', '')
        retval[minex]['wikipediatext'] = newstring
    return retval

def data_with_company(data_with_wikipedia:dict) -> dict:
    """
    Fetches company ownership for mine out of 
    Wikipedia text dump.

    Returns a new dictionary with the company name
    without the big wikipedia text dump.
    """
    # Wikipedia setup for mine company name.
    COMPANYPAT = r'[a-z]Company'
    # Lower case followed by upper case heuristic.
    ENDCOMPANYPAT = '[a-z][A-Z]'
    retval = copy.deepcopy(data_with_wikipedia)
    companypat = re.compile(COMPANYPAT)
    endcompanypat = re.compile(ENDCOMPANYPAT) 
    for minex in retval:
        print(minex)
        match = re.search(companypat, retval[minex]['wikipediatext'])
        if match:
            print('Company match span = ', match.span())
            companyidx = match.span()[1]
            match2 = re.search(endcompanypat, retval[minex]['wikipediatext'][companyidx:])
            print('End Company match span = ', match2.span())
            retval[minex]['company'] = retval[minex]['wikipediatext'][companyidx:companyidx + match2.span()[0] + 1]
        # Get rid of big text dump in return value.
        retval[minex].pop('wikipediatext')
    return retval

def info_output(data_with_company:dict) -> str:
    """
    Prints some output text to a file for each
    mine in the data_with_company dictionary.

    Returns string filename of output.
    """
    INFOLINEFMT = 'The {mine:s} mine is a big {commodity:s} mine in the State of {state:s} in the US.'
    COMPANYLINEFMT = '\n    {company:s} owns the mine.\n\n'
    retval = 'mine_info.txt'
    with open(retval, 'w') as f:
        for minex in data_with_company:
            print(INFOLINEFMT.format(**data_with_company[minex]), file=f)
            print(COMPANYLINEFMT.format(**data_with_company[minex]), file=f)
    return retval

def commodity_word_counts(data_with_wikipedia:dict, data_with_company:dict) -> dict:
    """
    Return dictionary keyed on mine with counts of
    commodity (e.g., zinc etc.) mentions on Wikipedia
    page (excluding ones in the company name).
    """
    retval = {}
    # This will probably miss some occurrences at mashed together
    # word boundaries. It is a rough estimate.
    # '\b[Gg]old\b'
    commoditypatfmt = r'\b[{0:s}{1:s}]{2:s}\b'
    for minex in data_with_wikipedia:
        print(minex)
        commodityuc = data_with_wikipedia[minex]['commodity'][0].upper()
        commoditypat = commoditypatfmt.format(commodityuc,
                                              data_with_wikipedia[minex]['commodity'][0],
                                              data_with_wikipedia[minex]['commodity'][1:])
        print(commoditypat)
        commoditymatches = re.findall(commoditypat, data_with_wikipedia[minex]['wikipediatext'])
        # pprint.pprint(commoditymatches)
        nummatchesraw = len(commoditymatches)
        print('Initial length of commoditymatches is {0:d}.'.format(nummatchesraw))
        companymatches = re.findall(data_with_company[minex]['company'],
                                    data_with_wikipedia[minex]['wikipediatext'])
        numcompanymatches = len(companymatches)
        print('Length of companymatches is {0:d}.'.format(numcompanymatches))
        # Is the commodity name part of the company name?
        print('commoditypat = ', commoditypat)
        print(data_with_company[minex]['company'])
        commoditymatchcompany = re.search(commoditypat, data_with_company[minex]['company'])
        if commoditymatchcompany:
            print('commoditymatchcompany.span() = ', commoditymatchcompany.span())
            nummatchesfinal = nummatchesraw - numcompanymatches
            retval[minex] = nummatchesfinal 
        else:
            retval[minex] = nummatchesraw 
    return retval

def colloquial_company_word_counts(data_with_wikipedia:dict) -> dict:
    """
    Find the number of times the company you associate with
    the property/mine (very subjective) is within the
    text of the mine's wikipedia article.
    """
    retval = {}
    for minex in data_with_wikipedia:
        colloquial_pat = data_with_wikipedia[minex]['colloquial association']
        print(minex)
        nummatches = len(re.findall(colloquial_pat, data_with_wikipedia[minex]['wikipediatext']))
        print('{0:d} matches for colloquial association {1:s}.'.format(nummatches, colloquial_pat))
        retval[minex] = nummatches
    return retval

def info_dict_merged(data_with_company:dict,
                     commodity_word_counts:dict,
                     colloquial_company_word_counts:dict) -> dict:
    """
    Get a dictionary with all the collected information
    in it minus the big Wikipedia text dump.
    """
    retval = copy.deepcopy(data_with_company)
    for minex in retval:
        retval[minex]['colloquial association count'] = colloquial_company_word_counts[minex]
        retval[minex]['commodity word count'] = commodity_word_counts[minex]
    return retval

def wikipedia_report(info_dict_merged:dict) -> str:
    """
    Writes out Wikipedia information (word counts)
    to file in prose; returns string filename.
    """
    retval = 'wikipedia_info.txt'
    colloqfmt = 'The {0:s} mine has {1:d} occurrences of colloquial association {2:s} in its Wikipedia article text.\n'
    commodfmt = 'The {0:s} mine has {1:d} occurrences of commodity name {2:s} in its Wikipedia article text.\n\n'
    with open(retval, 'w') as f:
        for minex in info_dict_merged:
            print(colloqfmt.format(info_dict_merged[minex]['mine'],
                                   info_dict_merged[minex]['colloquial association count'],
                                   info_dict_merged[minex]['colloquial association']), file=f)
            print(commodfmt.format(info_dict_merged[minex]['mine'],
                                   info_dict_merged[minex]['commodity word count'],
                                   info_dict_merged[minex]['commodity']), file=f)
    return retval

My REGEX abilities are somewhere between "I've heard the term REGEX and know regular expressions exist" and bracketed characters in each slot brute force. It worked for this toy example. Each Wikipedia page features the word "Company" followed by the name of the owning corporate entity.

Here is are the two text outputs the script produces from the information provided (Wikipedia articles from July, 2024):

The Red Dog mine is a big zinc mine in the State of Alaska in the US.

    NANA Regional Corporation owns the mine.


The Goldstrike mine is a big gold mine in the State of Nevada in the US.

    Barrick Gold owns the mine.


The Bingham Canyon mine is a big copper mine in the State of Utah in the US.

    Rio Tinto Group owns the mine.

The Red Dog mine has 21 occurrences of colloquial association Teck in its Wikipedia article text.

The Red Dog mine has 29 occurrences of commodity name zinc in its Wikipedia article text.


The Goldstrike mine has 0 occurrences of colloquial association Nevada Gold Mines in its Wikipedia article text.

The Goldstrike mine has 16 occurrences of commodity name gold in its Wikipedia article text.


The Bingham Canyon mine has 49 occurrences of colloquial association Kennecott in its Wikipedia article text.

The Bingham Canyon mine has 84 occurrences of commodity name copper in its Wikipedia article text.

Company names are relatively straightforward, although mining company and properties acquisitions and mergers being what they are, it can get complicated. I unwittingly chose three properties that Wikipedia reports as having one owner. Other big mines like Morenci, Arizona (copper) and Cortez, Nevada (gold) show more than one owner; that case is for another programming day. The Goldstrike information might be out of date - no mention of Nevada Gold Mines or Newmont (one mention, but in a different context). The Cortez Wikipedia page is more current, although it still doesn't mention Nevada Gold Mines.

The inclusion of colloquial association in the input csv file was an afterthought based on a lot of the Wikipedia information not being completely in line with what I thought I knew. Teck is the operator of the Red Dog Mine in Alaska. That name does get mentioned frequently in the Wikipedia article.

Enough mining stuff - it is a programming blog after all. Next time (not written yet) I hope to cover dressing up and highlighting the graphviz output a bit.

Thank you for stopping by.






Embedding an Image in an Outlook Email
2021-08-14T09:37:00.001-07:00
 I had a project where I needed to generate some draft emails programmatically in Outlook.

Inserting the company logo and some content related images took some googling to sort through. Ideally I wanted to encode the images as Base64 strings, but Outlook does not allow this.

The code below has some html I took from an existing email. Interpolating strings into html and, worse, hand editing it, is probably not best practice, but for purposes of this demo, it works. Also, there may be more abstracted tools and libraries for working with Outlook. I'm used to using win32com, so that is my general go-to tool for Microsoft Office and other historically significant desktop Windows apps.

Screenshot of draft email that script generates:

Code:

"""
Demo of how to embed a picture in a Microsoft Outlook email.
"""

import win32com.client as win32

PR_ATTACH_CONTENT_ID = 'http://schemas.microsoft.com/mapi/proptag/0x3712001F'
PR_ATTACHMENT_HIDDEN = 'http://schemas.microsoft.com/mapi/proptag/0x7FFE000B'

PICLOC = r'C:\Users\carl.trachte\Documents\paintbrush.png'

BODYFORMAT = """
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
   <head>
      <meta http-equiv=Content-Type content="text/html; charset=us-ascii">
      <meta name=Generator content="Microsoft Word 15 (filtered medium)">
      <!--[if !mso]>
      <style>v\:* {{behavior:url(#default#VML);}}
         o\:* {{behavior:url(#default#VML);}}
         w\:* {{behavior:url(#default#VML);}}
         .shape {{behavior:url(#default#VML);}}
      </style>
      <![endif]-->
      <style>
         <!--
            /* Font Definitions */
            @font-face
            	{{font-family:"Cambria Math";
            	panose-1:2 4 5 3 5 4 6 3 2 4;}}
            @font-face
            	{{font-family:Calibri;
            	panose-1:2 15 5 2 2 2 4 3 2 4;}}
            /* Style Definitions */
            p.MsoNormal, li.MsoNormal, div.MsoNormal
            	{{margin:0in;
            	font-size:11.0pt;
            	font-family:"Calibri",sans-serif;}}
            span.EmailStyle17
            	{{mso-style-type:personal-compose;
            	font-family:"Calibri",sans-serif;
            	color:windowtext;}}
            .MsoChpDefault
            	{{mso-style-type:export-only;
            	font-family:"Calibri",sans-serif;}}
            @page WordSection1
            	{{size:8.5in 11.0in;
            	margin:1.0in 1.0in 1.0in 1.0in;}}
            div.WordSection1
            	{{page:WordSection1;}}
            -->
      </style>
      <!--[if gte mso 9]>
      <xml>
         <o:shapedefaults v:ext="edit" spidmax="1026" />
      </xml>
      <![endif]--><!--[if gte mso 9]>
      <xml>
         <o:shapelayout v:ext="edit">
            <o:idmap v:ext="edit" data="1" />
         </o:shapelayout>
      </xml>
      <![endif]-->
   </head>
   <body lang=EN-US link="#0563C1" vlink="#954F72" style='word-wrap:break-word'>
      <div class=WordSection1>
         <table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 style='margin-left:-1.5pt;border-collapse:collapse'>
            <tr style='height:14.5pt'>
            </tr>
         </table>
         {0:s}
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
      </div>
      </div>
   </body>
</html>"""

GRAPHICFRAME = """
      <div class=WordSection1>
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
         <p class=MsoNormal>
            <b>
               <o:p>&nbsp;</o:p>
            </b>
         </p>
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
         <p class=MsoNormal>
            <img width=410 height=410 style='width:4.2666in;height:4.2666in' id="Picture_x0020_2" src="cid:{0:s}" alt="Chart&#10;&#10;Description automatically generated">
            <o:p></o:p>
         </p>
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
         <p class=MsoNormal>
            <o:p>&nbsp;</o:p>
         </p>
"""


def getoutlook():
    """
    Return Outlook object.
    """
    return win32.gencache.EnsureDispatch('outlook.application')

def makeemail(outlookobject, text, subject, recipient):
    """
    Return e-mail object
    """
    mail = outlookobject.CreateItem(0)
    mail.To = recipient
    mail.Subject = subject
    mail.HTMLBody = text
    return mail

def addlogoshow(mailobject):
    """
    Embed cid image in e-mail.

    Save e-mail and bring up in window.
    """
    attachmnt = mailobject.Attachments.Add(PICLOC, win32.constants.olByValue, 0, 'paintbrush.png')
    attachmnt.PropertyAccessor.SetProperty(PR_ATTACH_CONTENT_ID, 'paintbrush.png')
    attachmnt.PropertyAccessor.SetProperty(PR_ATTACHMENT_HIDDEN, False)
    mailobject.Save()
    mailobject.Display()
    mailobject.Save()
outlook = getoutlook()
htmlbody = BODYFORMAT.format(GRAPHICFRAME.format('paintbrush.png'))
mail = makeemail(outlook, htmlbody, 'blah', 'XXXXXXXX@gmail.com')
addlogoshow(mail)





Powershell Encoded Command, sqlcmd, and csv Query Output
2017-12-09T18:18:00.000-08:00
A while back I did a post on using sqlcmd and dumping data to Excel.  At the time I was using Microsoft SQL Server's bcp (bulk copy) utility to dump data to a csv file.



Use of bcp is blocked where I am working now.  But Powershell and sqlcmd are very much available on the Windows workstations we use.  Just as with bcp, smithing text for sqlcmd input can be a little tricky, same with Powershell.  But Powershell has an EncodedCommand feature which allows you to feed input to it as a base 64 string.  This will be a quick demo of the use of this feature and output of a faux comma delimited (csv) file with data.



Disclaimer:  scripts that rely extensively on os.system() calls from Python are indeed hacky and mousetrappy.  I think the saying goes "Necessity is a mother," or something similar.  Onward.



Getting the base 64 string from the original string:





First our SQL code that queries a mock table I made in my mock database:




USE test;




SELECT testpk,
       namex,
    [value]
FROM testtable
ORDER BY testpk;




We will call this file selectdata.sql.




Then the call to sqlcmd/Powershell:




sqlcmd -S localhost -i .\selectdata.sql -E -h -1 -s "," -W  | Tee-Object -FilePath .\testoutput




In Python (we have to use Python 2.7 in our environment, so this is Python 2.x specific):




Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> stringx = r'sqlcmd -S localhost -i .\selectdata.sql -E -h -1 -s "," -W | Tee-Object -FilePath .\testoutput'
>>> bytesx = stringx.encode('utf-16-le')
>>> encodedcommandx = base64.b64encode(bytesx)
>>> encodedcommandx
'cwBxAGwAYwBtAGQAIAAtAFMAIABsAG8AYwBhAGwAaABvAHMAdAAgAC0AaQAgAC4AXABzAGUAbABlAGMAdABkAGEAdABhAC4AcwBxAGwAIAAtAEUAIAAtAGgAIAAtADEAIAAtAHMAIAAiACwAIgAgAC0AVwAgAHwAIABUAGUAZQAtAE8AYgBqAGUAYwB0ACAALQBGAGkAbABlAFAAYQB0AGgAIAAuAFwAdABlAHMAdABvAHUAdABwAHUAdAA='
>>>








I had to type out my command in the Python interpreter.  When I pasted it in from GVim, it choked on the UTF encoding.





Now, Powershell:

PS C:\Users\ctrachte> $sqlcmdstring = 'sqlcmd -S localhost -i .\selectdata.sql -E -h -1 -s "," -W | Tee-Object -FilePath
 .\testoutput'
PS C:\Users\ctrachte> $encodedcommand = [Convert]::ToBase64String([Text.Encoding]::Unicode.GetBytes($sqlcmdstring))
PS C:\Users\ctrachte> $encodedcommand
cwBxAGwAYwBtAGQAIAAtAFMAIABsAG8AYwBhAGwAaABvAHMAdAAgAC0AaQAgAC4AXABzAGUAbABlAGMAdABkAGEAdABhAC4AcwBxAGwAIAAtAEUAIAAtAGgAIAAtADEAIAAtAHMAIAAiACwAIgAgAC0AVwAgAHwAIABUAGUAZQAtAE8AYgBqAGUAYwB0ACAALQBGAGkAbABlAFAAYQB0AGgAIAAuAFwAdABlAHMAdABvAHUAdABwAHUAdAA=
PS C:\Users\ctrachte>




OK, the two base 64 strings are the same, so we are good.




Command Execution from os.system() call:




>>> import os

>>> INVOKEPOWERSHELL = 'Powershell -EncodedCommand {0:s}'

>>> os.system(INVOKEPOWERSHELL.format(encodedcommandx))
Changed database context to 'test'.
000001,VOLUME,11.0
000002,YEAR,1999.0

(2 rows affected)
0
>>>




And, thanks to Powershell's version of UNIX-like system's tee command, we have a faux csv file as well as output to the command line.




Stackoverflow gave me much of what I needed to know for this:




Links:




Powershell's encoded command:




https://blogs.technet.microsoft.com/heyscriptingguy/2015/10/27/powertip-encode-string-and-execute-with-powershell/




sqlcmd's output to faux csv:




https://stackoverflow.com/questions/425379/how-to-export-data-as-csv-format-from-sql-server-using-sqlcmd




The UTF encoding stuff just took some trial and error and fiddling.




Thanks for stopping by.



















Filling in Missing Grouping Columns of MSSQL SSRS Report Dumped to Excel
2017-02-19T16:34:00.000-08:00


This is another simple but common problem in certain business environments:



1) Data are presented via a Microsoft SQL Server Reporting Services report, BUT



2) The user wants the data in Excel, and, further, wants to play with it (pivot, etc.) there.  The problem is that the grouping column labels are not in every record, only in the one row that begins the list of records for that group (sanitized screenshot below):





But I don't WANT to copy and paste all those groupings for 30,000 records :*-(

I had this assignment recently from a remote request.  It took about four rounds of an e-mail exchange to figure out that it really wasn't a data problem, but a formatting one that needed solving.



It is possible to do the whole thing in Python.  I did the Excel part by hand in order to get a handle on the data:



1) In Excel, delete the extra rows on top of the report leaving just the headers and the data.



2) In Excel, select everything on the data page, format the cells correctly by unselecting the Merge Cells and Wraparound options.



3) In Excel, at this point you should be able to see if there are extra empty columns as space fillers; delete them.  Save the worksheet as a csv file.



4) In a text editor, open your csv file, identify any empty rows, and delete them.  Change column header names as desired.



Now the Python part:



#!python36

"""
Doctor csv dump from unmerged cell
dump of SSRS dump from MSSQL database.

Fill in cell gaps where merged
cells had only one grouping value
so that all rows are complete records.
"""

import pprint

COMMA = ','
EMPTY = ''

INFILE = 'rawdata.csv'
OUTFILE = 'canneddumpfixed.csv'

ERRORFLAG = 'ERROR!' 

f = open(INFILE, 'r')
headerline = next(f)
numbercolumns = len(headerline.split(COMMA))

f2 = open(OUTFILE, 'w')

# Assume at least one data column on far right.
missingvalues = (numbercolumns - 1) * [ERRORFLAG]

for linex in f:
    print('Processing line {:s} . . .'.format(linex))
    splitrecord = linex.split(COMMA)
    for slotx in range(0, numbercolumns - 1):
        if splitrecord[slotx] != EMPTY:
            missingvalues[slotx] = splitrecord[slotx]
        else:
            splitrecord[slotx] = missingvalues[slotx]
    f2.write(COMMA.join(splitrecord))

f2.close()

print('Finished')



At this point you've got your data in csv format - you can open it in Excel and go to work.



There may be a free or COTS (commercial off the shelf) utility that does all this somewhere in the Microsoft "ecosystem" (I think that's their fancy enviro-friendly word for vendor-user community) but I don't know of one.





Thanks for stopping by. 













Crude Testing of Equivalent Code With assert
2017-02-15T19:40:00.000-08:00
In engineering and business environments, it is common to have to



1) recreate an equivalent calculation in a different format for a different purpose and check the results against the original calculation.





2) shepherd a calculation process from one vendor system through a transition to another (an upgrade, for example) by hacking a set of provisional scripts together.





3) implement a bunch of linear regressions in calculations.  If I recall correctly, there has been a linear regression functionality in Excel for ages (since the early 90's?); it is the tried and (maybe) true tool of data fitters/forcers everywhere.  Conceivably you could accurately, if not precisely, model just about any curve with enough linear segments.  Mercifully, the ones I show below have only two segments per data set.



This problem embodies all three bullets above.  I've sanitized the code which makes it a little ridiculous, but no less voluminous (sorry).



Here's what we have in the vendor's system - it is Python (2.7) code, but it's run inside special a la carte purchased software that my department doesn't have.  Also, it's full of a bunch of constants that I'm not really comfortable recognizing or maintaining:



"""
Cut and pasted formulas from vendor
specific GUI/Python API.
"""

# LOC1
def loc1fromvendor(CONTROL1,
                   CONTROL2,
                   x):
    """
    Loc1 y calculation from vendor.

    CONTROL1 is the primary code (integer
    or round digit float).
    CONTROL2 is the secondary code (integer
    or round digit float).
    x is the x-axis input.  (float).

    Returns float.
    """
    DEFAULTY = 2.50
    
    if CONTROL1 == 9:
            if CONTROL2 == 1:
                if x > 1.275:
                    Y = (-0.0003 * x) + 6.4781
                else:
                    Y = 2.53
            else:
                Y = 2.54
    elif CONTROL1 == 8:
            Y = 2.6
    elif CONTROL1 == 7:
            if CONTROL2 == 1:
                if x > 1.315:
                    Y = -0.003 * x + 6.548
                else:
                    Y = 2.6
            else:
                Y = -0.0031 * x + 2.958
    elif CONTROL1 == 6:
            if CONTROL2 == 1:
                if x >1.310:
                     Y = -0.0018 * x + 4.9307
                else:
                    Y = 2.57
            else:
                Y = -0.0004 * x + 3.0612
    elif CONTROL1 == 5:
            if CONTROL2 == 1:
                if x >1.250:
                    Y = -0.0026 * x + 5.7152
                else:
                    Y = 2.47
            else:
                Y = -0.0003 * x + 2.8733
    elif CONTROL1 == 4:
            if CONTROL2 == 1:
                if x >1.290:
                    Y = -0.0032 * x + 6.7257
                else:
                    Y = 2.6
            else:
                Y = -0.0002 * x + 2.8215
    elif CONTROL1 == 1:
            if CONTROL2 == 1:
                Y = 2.35
            else:
                Y = 2.45
    else:
            Y = DEFAULTY
    return Y

# LOC2
def loc2fromvendor(CONTROL1,
                   CONTROL2,
                   x):
    """
    Loc2 y calculation from vendor.

    CONTROL1 is the primary code (integer
    or round digit float).
    CONTROL2 is the secondary code (integer
    or round digit float).
    x is the x-axis input.  (float).

    Returns float.
    """
    DEFAULTY = 2.50
    
    if CONTROL1 == 9:
        if CONTROL2 == 1:
                Y = -0.0006 * x + 3.3121
        else:
                Y = -0.0006 * x + 3.3121
    elif CONTROL1 == 8:
            if CONTROL2 == 1:
                if x >1.050:
                    Y = 2.65
                else:
                    Y = 2.65
            else:
                if x >1.050:
                    Y = 2.65
                else:
                    Y = 2.65
    elif CONTROL1 == 7:
            if CONTROL2 == 1:
                if x > 1.050:
                    Y = -0.0012 * x + 3.886
                else:
                    Y = -0.0012 * x + 3.886
            else:
                if x > 1.050:
                    Y = -0.00007 * x + 2.6787
                else:
                    Y = -0.00007 * x + 2.6787
    elif CONTROL1 == 6:
            if CONTROL2 == 1:
                if x >1.050:
                    Y = -0.001 * x + 3.731
                else:
                    Y = -0.001 * x + 3.731
            else:
                if x >1.050:
                    Y = -0.0012 * x + 4.0757
                else:
                    Y = -0.0012 * x + 4.0757
    elif CONTROL1 == 5:
            if CONTROL2 == 1:
                if x >1.050:
                    Y = 2.1
                else:
                    Y = 2.1
            else:
                if x >1.050:
                    Y = -0.0003 * x + 2.9564
                else:
                    Y = -0.0003 * x + 2.9564
    elif CONTROL1 == 4:
            if CONTROL2 == 1:
                if x >1.050:
                    Y = -0.000009 * x + 2.1972
                else:
                    Y = -0.000009 *x + 2.1972
            else:
                if x >1.050:
                    Y = -0.0005 * x + 3.2461
                else:
                    Y = -0.0005 * x + 3.2461                
    elif CONTROL1 == 1:
            if CONTROL2 == 1:
                Y = -0.001 * x + 3.7257
            else:
                Y = -0.001 * x + 3.7257
    else:
            Y = DEFAULTY
    return Y

# LOC3
def loc3fromvendor(CONTROL1,
                   CONTROL2,
                   x):
    """
    Loc3 y calculation from vendor.

    CONTROL1 is the primary code (integer
    or round digit float).
    CONTROL2 is the secondary code (integer
    or round digit float).
    x is the x-axis input.  (float).

    Returns float.
    """
    DEFAULTY = 2.50
    
    if CONTROL1 == 9:
            Y = 2.49
    elif CONTROL1 == 8:
            if x > 1.000:
                Y = -0.0006 * x + 3.3291
            else:
                Y = 2.64
    elif CONTROL1 == 7:
            if x > 1.050:
                Y = -0.0009 * x + 3.5929
            else:
                Y = 2.67
    elif CONTROL1 == 6:
            if x > 1.080:
                Y = -0.0013 * x + 4.0665
            else:
                # Debug.
                # print 'x in vendor function = {:f}'.format(x)
                Y = 2.65
    elif CONTROL1 == 5:
            if x > 950:
                Y = -0.001 * x + 3.4996
            else:
                Y = 2.59
    elif CONTROL1 == 4:
            if x > 1.100:
                Y = -0.0018 * x + 4.6690
            else:
                Y = 2.68
    elif CONTROL1 == 1:
            if x > 1.000:
                Y = -0.0004 * x + 2.8857
            else:
                Y = 2.49
    else:
            Y = DEFAULTY
    return Y

# LOC4
def loc4fromvendor(CONTROL1,
                   CONTROL2,
                   x):
    """
    Loc4 y calculation from vendor.

    CONTROL1 is the primary code (integer
    or round digit float).
    CONTROL2 is the secondary code (integer
    or round digit float).
    x is the x-axis input.  (float).

    Returns float.
    """
    DEFAULTY = 2.50
    
    if CONTROL1 == 9:
        Y = -0.0000008 * x + 2.6761
    elif CONTROL1 == 8:
            Y = -0.000003 * x + 2.6975
    elif CONTROL1 == 7:
            if CONTROL2 == 1:
                if x > 1.000:
                    Y = -0.0018 * x + 4.3902
                else:
                    Y = 2.60
            else:
                Y = -0.00009 * x + 2.7334
    elif CONTROL1 == 6:
            if CONTROL2 == 1:
                if x > 1.100:
                     Y = -0.0013 * x + 4.0322
                else:
                    Y = 2.58
            else:
                Y = -0.0002 * x + 2.8081
    elif CONTROL1 == 5:
            if CONTROL2 == 1:
                Y = -0.0018 * x + 4.2758
            else:
                Y = -0.0001 * x + 2.6535
    elif CONTROL1 == 4:
            if CONTROL2 == 1:
                if x > 1.000:
                    Y = -0.002 * x + 4.5548
                else:
                    Y = 2.60
            else:
                if x > 1125:
                    Y = -0.0011 * x + 3.9184
                else:
                    Y = 2.65
    elif CONTROL1 == 1:
            Y = -0.0003 * x + 2.7802
    else:
            Y = DEFAULTY
    return Y





My code is less multiple function based and more a single function with a bunch of lookup dictionaries rolled into one big dictionary.  I'm not arguing my approach is necessarily better.  For instance, I implemented my x variable ranges with lower bounds based on the precision of my data.  This isn't very portable.



The need to lock down my results to keep them in line with the original led me to use of the assert statement and the writing of a little walk of my dictionary against my function and the vendor's.  This way, when I get a new "vendor function" (actually a snippet of code for a particular location or area) I can paste it into this crude ersatz test suite and see what needs changing.



I caught a few missed decimal places, typos, transposed digits, and plain old omissions in my code using this approach.  It is possible I've gone overboard with constants.  I don't care.  I have to read them and the only way I can keep them straight is by lining up the decimal places and locking them down as named constants (programmatically they are variables, but I'm not changing them).



"But why don't they and why don't you use scientific notation?"



As we used to say in the Navy years ago, "There is the right way, there is the wrong way, and the Navy way."  Guess which one the vendor uses?  Onward.



Here's my code with the "test" of equivalency for the two approaches:



"""
Attempt at generic script to process linear regressions
for multiple areas.
"""

import sys

import vendorformulas as vfx

# Loc abbreviations.
LOC1 = 'loc1'
LOC2 = 'loc2'
LOC3 = 'loc3'
LOC4 = 'loc4'

DEFAULTY = 2.50 

CTL2ONE = 1
CTL2TWO = 2

BIGX = 5000.0
LITTLEX = 0.0

TYPE9 = 9
TYPE8 = 8
TYPE7 = 7
TYPE6 = 6
TYPE5 = 5
TYPE4 = 4
TYPE1 = 1
# Undefined control1 type for default for each loc.
UNDEF = 99

slope = 'm'
b = 'b'

# Compute y using formula (y = mx + b), control1, x, control2
# nested dictionaries
#     control2
#         x range
#             m
#             b
# Original logic gives unassigned CONTROL2 block to CTL2TWO interpretation
# Honor this in logic in program.

# Slope values.
NOSLOPE = 0.0

NEG0032000 = -0.0032000
NEG0031000 = -0.0031000
NEG0030000 = -0.0030000
NEG0026000 = -0.0026000
NEG0020000 = -0.0020000
NEG0018000 = -0.0018000
NEG0013000 = -0.0013000
NEG0012000 = -0.0012000
NEG0011000 = -0.0011000
NEG0010000 = -0.0010000
NEG0009000 = -0.0009000
NEG0006000 = -0.0006000
NEG0005000 = -0.0005000
NEG0004000 = -0.0004000
NEG0003000 = -0.0003000
NEG0002000 = -0.0002000
NEG0001000 = -0.0001000
NEG0000900 = -0.0000900
NEG0000700 = -0.0000700
NEG0000090 = -0.0000090
NEG0000030 = -0.0000030
NEG0000008 = -0.0000008

# Intercept values.
T2PT1000 = 2.1000
T2PT1972 = 2.1972
T2PT3500 = 2.3500
T2PT4500 = 2.4500
T2PT4700 = 2.4700
T2PT4900 = 2.4900
T2PT5300 = 2.5300
T2PT5400 = 2.5400
T2PT5700 = 2.5700
T2PT5800 = 2.5800
T2PT5900 = 2.5900
T2PT6000 = 2.6000
T2PT6400 = 2.6400
T2PT6500 = 2.6500
T2PT6535 = 2.6535
T2PT6700 = 2.6700
T2PT6761 = 2.6761
T2PT6787 = 2.6787
T2PT6800 = 2.6800
T2PT6975 = 2.6975
T2PT7334 = 2.7334
T2PT7802 = 2.7802
T2PT8081 = 2.8081
T2PT8215 = 2.8215
T2PT8733 = 2.8733
T2PT8857 = 2.8857
T2PT9564 = 2.9564
T2PT9580 = 2.9580
T3PT0612 = 3.0612
T3PT2461 = 3.2461
T3PT3121 = 3.3121
T3PT3291 = 3.3291
T3PT4996 = 3.4996
T3PT5929 = 3.5929
T3PT7257 = 3.7257
T3PT7310 = 3.7310
T3PT8860 = 3.8860
T3PT9184 = 3.9184
F4PT0322 = 4.0322
F4PT0665 = 4.0665
F4PT0757 = 4.0757
F4PT2758 = 4.2758
F4PT3902 = 4.3902
F4PT5548 = 4.5548
F4PT6690 = 4.6690
F4PT9307 = 4.9307
F5PT7152 = 5.7152
S6PT4781 = 6.4781
S6PT5480 = 6.5480
S6PT7257 = 6.7257

LOC1YS = {TYPE9:
             {CTL2ONE:
                 {(LITTLEX, 1.27500):
                     {slope:NOSLOPE, b:T2PT5300},
                  (1.27501, BIGX):
                     {slope:NEG0003000, b:S6PT4781}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT5400}}},
          TYPE8:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT6000}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT6000}}},
          TYPE7:
             {CTL2ONE:
                 {(LITTLEX, 1.31500):
                     {slope:NOSLOPE, b:T2PT6000},
                  (1.31501, BIGX):
                     {slope:NEG0030000, b:S6PT5480}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0031000, b:T2PT9580}}},
          TYPE6:
             {CTL2ONE:
                 {(LITTLEX, 1.31000):
                     {slope:NOSLOPE, b:T2PT5700},
                  (1.31001, BIGX):
                     {slope:NEG0018000, b:F4PT9307}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0004000, b:T3PT0612}}},
          TYPE5:
             {CTL2ONE:
                 {(LITTLEX, 1.25000):
                     {slope:NOSLOPE, b:T2PT4700},
                  (1.25001, BIGX):
                     {slope:NEG0026000, b:F5PT7152}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0003000, b:T2PT8733}}},
          TYPE4:
             {CTL2ONE:
                 {(LITTLEX, 1.29000):
                     {slope:NOSLOPE, b:T2PT6000},
                  (1.29001, BIGX):
                     {slope:NEG0032000, b:S6PT7257}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0002000, b:T2PT8215}}},
          TYPE1:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT3500}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT4500}}},
          UNDEF:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}}}}
# END LOC1

# LOC2
LOC2YS = {TYPE9:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0006000, b:T3PT3121}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0006000, b:T3PT3121}}},
          TYPE8:
              {CTL2ONE:
                  {(LITTLEX, BIGX):
                      {slope:NOSLOPE, b:T2PT6500}},
               CTL2TWO:
                  {(LITTLEX, BIGX):
                      {slope:NOSLOPE, b:T2PT6500}}},
          TYPE7:{CTL2ONE:
                  {(LITTLEX, BIGX):
                      {slope:NEG0012000, b:T3PT8860}},
               CTL2TWO:
                  {(LITTLEX, BIGX):
                      {slope:NEG0000700, b:T2PT6787}}},
          TYPE6:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0010000, b:T3PT7310}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0012000, b:F4PT0757}}},
          TYPE5:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT1000}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0003000, b:T2PT9564}}},
          TYPE4:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0000090, b:T2PT1972}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0005000, b:T3PT2461}}},
          TYPE1:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0010000, b:T3PT7257}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0010000, b:T3PT7257}}},
          UNDEF:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}}}}
# END LOC2

LOC3YS = {TYPE9:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT4900}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:T2PT4900}}},
          TYPE8:{CTL2ONE:
                  {(LITTLEX, 1.00000):
                      {slope:NOSLOPE, b:T2PT6400},
                   (1.00001, BIGX):
                      {slope:NEG0006000, b:T3PT3291}},
               CTL2TWO:
                  {(LITTLEX, 1.00000):
                      {slope:NOSLOPE, b:T2PT6400},
                   (1.00001, BIGX):
                      {slope:NEG0006000, b:T3PT3291}}},
          TYPE7:{CTL2ONE:
                  {(LITTLEX, 1.05000):
                      {slope:NOSLOPE, b:T2PT6700},
                   (1.05001, BIGX):
                      {slope:NEG0009000, b:T3PT5929}},
               CTL2TWO:
                  {(LITTLEX, 1.05000):
                      {slope:NOSLOPE, b:T2PT6700},
                   (1.050001, BIGX):
                      {slope:NEG0009000, b:T3PT5929}}},
          TYPE6:
             {CTL2ONE:
                 {(LITTLEX, 1.08000):
                     {slope:NOSLOPE, b:T2PT6500},
                  (1.08001, BIGX):
                     {slope:NEG0013000, b:F4PT0665}},
              CTL2TWO:
                 {(LITTLEX, 1.08000):
                     {slope:NOSLOPE, b:T2PT6500},
                  (1.08001, BIGX):
                     {slope:NEG0013000, b:F4PT0665}}},
          TYPE5:
             {CTL2ONE:
                 {(LITTLEX, 950.0):
                     {slope:NOSLOPE, b:T2PT5900},
                  (950.01, BIGX):
                     {slope:NEG0010000, b:T3PT4996}},
              CTL2TWO:
                 {(LITTLEX, 950.0):
                     {slope:NOSLOPE, b:T2PT5900},
                  (950.01, BIGX):
                     {slope:NEG0010000, b:T3PT4996}}},
          TYPE4:
             {CTL2ONE:
                 {(LITTLEX, 1.10000):
                     {slope:NOSLOPE, b:T2PT6800},
                  (1.10001, BIGX):
                     {slope:NEG0018000, b:F4PT6690}},
              CTL2TWO:
                 {(LITTLEX, 1.10000):
                     {slope:NOSLOPE, b:T2PT6800},
                  (1.10001, BIGX):
                     {slope:NEG0018000, b:F4PT6690}}},
          TYPE1:
             {CTL2ONE:
                 {(LITTLEX, 1.00000):
                     {slope:NOSLOPE, b:T2PT4900},
                  (1.00001, BIGX):
                     {slope:NEG0004000, b:T2PT8857}},
              CTL2TWO:
                 {(LITTLEX, 1.00000):
                     {slope:NOSLOPE, b:T2PT4900},
                  (1.00001, BIGX):
                     {slope:NEG0004000, b:T2PT8857}}},
          UNDEF:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}}}}
# END LOC3

# LOC4
LOC4YS = {TYPE9:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0000008, b:T2PT6761}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0000008, b:T2PT6761}}},
          TYPE8:{CTL2ONE:
                  {(LITTLEX, BIGX):
                      {slope:NEG0000030, b:T2PT6975}},
               CTL2TWO:
                  {(LITTLEX, BIGX):
                      {slope:NEG0000030, b:T2PT6975}}},
          TYPE7:
             {CTL2ONE:
                 {(LITTLEX, 1.00000):
                     {slope:NOSLOPE, b:T2PT6000},
                  (1.00001, BIGX):
                     {slope:NEG0018000, b:F4PT3902}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0000900, b:T2PT7334}}},
          TYPE6:
             {CTL2ONE:
                 {(LITTLEX, 1.10000):
                     {slope:NOSLOPE, b:T2PT5800},
                  (1.10001, BIGX):
                     {slope:NEG0013000, b:F4PT0322}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0002000, b:T2PT8081}}},
          TYPE5:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0018000, b:F4PT2758}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0001000, b:T2PT6535}}},
          TYPE4:
             {CTL2ONE:
                 {(LITTLEX, 1.00000):
                     {slope:NOSLOPE, b:T2PT6000},
                  (1.00001, BIGX):
                     {slope:NEG0020000, b:F4PT5548}},
              CTL2TWO:
                 {(LITTLEX, 1125.0):
                     {slope:NOSLOPE, b:T2PT6500},
                  (1125.01, BIGX):
                     {slope:NEG0011000, b:T3PT9184}}},
          TYPE1:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NEG0003000, b:T2PT7802}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NEG0003000, b:T2PT7802}}},
          UNDEF:
             {CTL2ONE:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}},
              CTL2TWO:
                 {(LITTLEX, BIGX):
                     {slope:NOSLOPE, b:DEFAULTY}}}}
# END LOC4

YS = {LOC1:LOC1YS,
      LOC2:LOC2YS,
      LOC3:LOC3YS,
      LOC4:LOC4YS}

VALIDCONTROL1 = [TYPE9, TYPE8, TYPE7, TYPE6, TYPE5, TYPE4, TYPE1]

RETURNDEFAULTMSG = 'Returning default Y for  {0:s}, {1:2.0f}, {2:2.0f}, {3:8.5f} . .  .'
TESTINGMSG = 'Testing dictionary based y == function based y for {0:s}, {1:d}, {2:d}, {3:8.5f} . .  .'
ASSERTIONERRORMSG = 'Assertion Error for {0:s}, {1:f}, {2:d}, {3:8.5f} . . .'

def gety(loc, control1, x, control2):
    """
    y calculation for y = mx + b.

    loc is the four letter loc abbreviation (loc1).

    control1 is the integer CONTROL1 code.

    x is a float for the x component of y = mx + b.

    control2 is the integer CONTROL2 code.
    """
    # Compute y using formula (y = mx + b), control1, x, control2.
    # Match loc.
    ydictionary = YS[loc]
    # Check if control1 code belongs to recognized types.
    if control1 in VALIDCONTROL1:
        # Match control1.
        for control2x in ydictionary[control1]:
            # match control2.
            for xrangex in ydictionary[control1][control2]:
                # match x range.
                if (x >= xrangex[0] and
                    x <= xrangex[1] and control2x == control2):
                    mxb = ydictionary[control1][control2][xrangex]
                    y = mxb[slope] * x + mxb[b]
                    return y
        # Possible that control2 not defined;
        # Defaults to CONTROL2TWO.
        for xrangex in ydictionary[control1][CTL2TWO]:
            # match elevation range.
            if (x >= xrangex[0] and
                x <= xrangex[1]):
                mxb = ydictionary[control1][CTL2TWO][xrangex]
                y = mxb[slope] * x + mxb[b]
                return y
    # Doesn't matter if CTL2TWO or CTL2ONE or undefined
    #     - default for loc will always be [UNDEF][CTL2TWO].
    print RETURNDEFAULTMSG.format(loc, control1, control2, x)
    return ydictionary[UNDEF][CTL2TWO][(LITTLEX, BIGX)][b]

# TEST Calculations.
TESTFUNCS = {LOC1:vfx.loc1fromvendor,
             LOC2:vfx.loc2fromvendor,
             LOC3:vfx.loc3fromvendor,
             LOC4:vfx.loc4fromvendor}

for locx in YS:
    for control1 in YS[locx]:
        for control2 in YS[locx][control1]:
            for xrangex in YS[locx][control1][control2]:
                for z in xrangex:
                    dictionarybasedy = gety(locx, control1, z, control2)
                    functionbasedy = TESTFUNCS[locx](control1, control2, z)
                    print TESTINGMSG.format(locx, control1, control2, z)
                    print 'dictionarybasedy = {0:8.7f}'.format(dictionarybasedy)
                    print 'functionbasedy = {0:8.7f}'.format(functionbasedy)
                    try:
                        assert dictionarybasedy == functionbasedy
                    except AssertionError:
                        print ASSERTIONERRORMSG.format(locx, control1, control2, z)
                        sys.exit()














And the output:


Testing dictionary based y == function based y for loc2, 1, 1,  0.00000 . .  .
dictionarybasedy = 3.7257000
functionbasedy = 3.7257000
Testing dictionary based y == function based y for loc2, 1, 1, 5000.00000 . .  .
dictionarybasedy = -1.2743000
functionbasedy = -1.2743000
Testing dictionary based y == function based y for loc2, 1, 2,  0.00000 . .  .
dictionarybasedy = 3.7257000
functionbasedy = 3.7257000
Testing dictionary based y == function based y for loc2, 1, 2, 5000.00000 . .  .
dictionarybasedy = -1.2743000
functionbasedy = -1.2743000
Returning default Y for  loc2, 99,  1,  0.00000 . .  .
Testing dictionary based y == function based y for loc2, 99, 1,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc2, 99,  1, 5000.00000 . .  .
Testing dictionary based y == function based y for loc2, 99, 1, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc2, 99,  2,  0.00000 . .  .
Testing dictionary based y == function based y for loc2, 99, 2,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc2, 99,  2, 5000.00000 . .  .
Testing dictionary based y == function based y for loc2, 99, 2, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Testing dictionary based y == function based y for loc2, 4, 1,  0.00000 . .  .
dictionarybasedy = 2.1972000
functionbasedy = 2.1972000
Testing dictionary based y == function based y for loc2, 4, 1, 5000.00000 . .  .
dictionarybasedy = 2.1522000
functionbasedy = 2.1522000
Testing dictionary based y == function based y for loc2, 4, 2,  0.00000 . .  .
dictionarybasedy = 3.2461000
functionbasedy = 3.2461000
Testing dictionary based y == function based y for loc2, 4, 2, 5000.00000 . .  .
dictionarybasedy = 0.7461000
functionbasedy = 0.7461000
Testing dictionary based y == function based y for loc2, 5, 1,  0.00000 . .  .
dictionarybasedy = 2.1000000
functionbasedy = 2.1000000
Testing dictionary based y == function based y for loc2, 5, 1, 5000.00000 . .  .
dictionarybasedy = 2.1000000
functionbasedy = 2.1000000
Testing dictionary based y == function based y for loc2, 5, 2,  0.00000 . .  .
dictionarybasedy = 2.9564000
functionbasedy = 2.9564000
Testing dictionary based y == function based y for loc2, 5, 2, 5000.00000 . .  .
dictionarybasedy = 1.4564000
functionbasedy = 1.4564000
Testing dictionary based y == function based y for loc2, 6, 1,  0.00000 . .  .
dictionarybasedy = 3.7310000
functionbasedy = 3.7310000
Testing dictionary based y == function based y for loc2, 6, 1, 5000.00000 . .  .
dictionarybasedy = -1.2690000
functionbasedy = -1.2690000
Testing dictionary based y == function based y for loc2, 6, 2,  0.00000 . .  .
dictionarybasedy = 4.0757000
functionbasedy = 4.0757000
Testing dictionary based y == function based y for loc2, 6, 2, 5000.00000 . .  .
dictionarybasedy = -1.9243000
functionbasedy = -1.9243000
Testing dictionary based y == function based y for loc2, 7, 1,  0.00000 . .  .
dictionarybasedy = 3.8860000
functionbasedy = 3.8860000
Testing dictionary based y == function based y for loc2, 7, 1, 5000.00000 . .  .
dictionarybasedy = -2.1140000
functionbasedy = -2.1140000
Testing dictionary based y == function based y for loc2, 7, 2,  0.00000 . .  .
dictionarybasedy = 2.6787000
functionbasedy = 2.6787000
Testing dictionary based y == function based y for loc2, 7, 2, 5000.00000 . .  .
dictionarybasedy = 2.3287000
functionbasedy = 2.3287000
Testing dictionary based y == function based y for loc2, 8, 1,  0.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc2, 8, 1, 5000.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc2, 8, 2,  0.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc2, 8, 2, 5000.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc2, 9, 1,  0.00000 . .  .
dictionarybasedy = 3.3121000
functionbasedy = 3.3121000
Testing dictionary based y == function based y for loc2, 9, 1, 5000.00000 . .  .
dictionarybasedy = 0.3121000
functionbasedy = 0.3121000
Testing dictionary based y == function based y for loc2, 9, 2,  0.00000 . .  .
dictionarybasedy = 3.3121000
functionbasedy = 3.3121000
Testing dictionary based y == function based y for loc2, 9, 2, 5000.00000 . .  .
dictionarybasedy = 0.3121000
functionbasedy = 0.3121000
Testing dictionary based y == function based y for loc3, 1, 1,  0.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 1, 1,  1.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 1, 1,  1.00001 . .  .
dictionarybasedy = 2.8853000
functionbasedy = 2.8853000
Testing dictionary based y == function based y for loc3, 1, 1, 5000.00000 . .  .
dictionarybasedy = 0.8857000
functionbasedy = 0.8857000
Testing dictionary based y == function based y for loc3, 1, 2,  0.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 1, 2,  1.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 1, 2,  1.00001 . .  .
dictionarybasedy = 2.8853000
functionbasedy = 2.8853000
Testing dictionary based y == function based y for loc3, 1, 2, 5000.00000 . .  .
dictionarybasedy = 0.8857000
functionbasedy = 0.8857000
Returning default Y for  loc3, 99,  1,  0.00000 . .  .
Testing dictionary based y == function based y for loc3, 99, 1,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc3, 99,  1, 5000.00000 . .  .
Testing dictionary based y == function based y for loc3, 99, 1, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc3, 99,  2,  0.00000 . .  .
Testing dictionary based y == function based y for loc3, 99, 2,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc3, 99,  2, 5000.00000 . .  .
Testing dictionary based y == function based y for loc3, 99, 2, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Testing dictionary based y == function based y for loc3, 4, 1,  0.00000 . .  .
dictionarybasedy = 2.6800000
functionbasedy = 2.6800000
Testing dictionary based y == function based y for loc3, 4, 1,  1.10000 . .  .
dictionarybasedy = 2.6800000
functionbasedy = 2.6800000
Testing dictionary based y == function based y for loc3, 4, 1,  1.10001 . .  .
dictionarybasedy = 4.6670200
functionbasedy = 4.6670200
Testing dictionary based y == function based y for loc3, 4, 1, 5000.00000 . .  .
dictionarybasedy = -4.3310000
functionbasedy = -4.3310000
Testing dictionary based y == function based y for loc3, 4, 2,  0.00000 . .  .
dictionarybasedy = 2.6800000
functionbasedy = 2.6800000
Testing dictionary based y == function based y for loc3, 4, 2,  1.10000 . .  .
dictionarybasedy = 2.6800000
functionbasedy = 2.6800000
Testing dictionary based y == function based y for loc3, 4, 2,  1.10001 . .  .
dictionarybasedy = 4.6670200
functionbasedy = 4.6670200
Testing dictionary based y == function based y for loc3, 4, 2, 5000.00000 . .  .
dictionarybasedy = -4.3310000
functionbasedy = -4.3310000
Testing dictionary based y == function based y for loc3, 5, 1, 950.01000 . .  .
dictionarybasedy = 2.5495900
functionbasedy = 2.5495900
Testing dictionary based y == function based y for loc3, 5, 1, 5000.00000 . .  .
dictionarybasedy = -1.5004000
functionbasedy = -1.5004000
Testing dictionary based y == function based y for loc3, 5, 1,  0.00000 . .  .
dictionarybasedy = 2.5900000
functionbasedy = 2.5900000
Testing dictionary based y == function based y for loc3, 5, 1, 950.00000 . .  .
dictionarybasedy = 2.5900000
functionbasedy = 2.5900000
Testing dictionary based y == function based y for loc3, 5, 2, 950.01000 . .  .
dictionarybasedy = 2.5495900
functionbasedy = 2.5495900
Testing dictionary based y == function based y for loc3, 5, 2, 5000.00000 . .  .
dictionarybasedy = -1.5004000
functionbasedy = -1.5004000
Testing dictionary based y == function based y for loc3, 5, 2,  0.00000 . .  .
dictionarybasedy = 2.5900000
functionbasedy = 2.5900000
Testing dictionary based y == function based y for loc3, 5, 2, 950.00000 . .  .
dictionarybasedy = 2.5900000
functionbasedy = 2.5900000
Testing dictionary based y == function based y for loc3, 6, 1,  0.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc3, 6, 1,  1.08000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc3, 6, 1,  1.08001 . .  .
dictionarybasedy = 4.0650960
functionbasedy = 4.0650960
Testing dictionary based y == function based y for loc3, 6, 1, 5000.00000 . .  .
dictionarybasedy = -2.4335000
functionbasedy = -2.4335000
Testing dictionary based y == function based y for loc3, 6, 2,  0.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc3, 6, 2,  1.08000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc3, 6, 2,  1.08001 . .  .
dictionarybasedy = 4.0650960
functionbasedy = 4.0650960
Testing dictionary based y == function based y for loc3, 6, 2, 5000.00000 . .  .
dictionarybasedy = -2.4335000
functionbasedy = -2.4335000
Testing dictionary based y == function based y for loc3, 7, 1,  0.00000 . .  .
dictionarybasedy = 2.6700000
functionbasedy = 2.6700000
Testing dictionary based y == function based y for loc3, 7, 1,  1.05000 . .  .
dictionarybasedy = 2.6700000
functionbasedy = 2.6700000
Testing dictionary based y == function based y for loc3, 7, 1,  1.05001 . .  .
dictionarybasedy = 3.5919550
functionbasedy = 3.5919550
Testing dictionary based y == function based y for loc3, 7, 1, 5000.00000 . .  .
dictionarybasedy = -0.9071000
functionbasedy = -0.9071000
Testing dictionary based y == function based y for loc3, 7, 2,  0.00000 . .  .
dictionarybasedy = 2.6700000
functionbasedy = 2.6700000
Testing dictionary based y == function based y for loc3, 7, 2,  1.05000 . .  .
dictionarybasedy = 2.6700000
functionbasedy = 2.6700000
Testing dictionary based y == function based y for loc3, 7, 2,  1.05000 . .  .
dictionarybasedy = 3.5919550
functionbasedy = 3.5919550
Testing dictionary based y == function based y for loc3, 7, 2, 5000.00000 . .  .
dictionarybasedy = -0.9071000
functionbasedy = -0.9071000
Testing dictionary based y == function based y for loc3, 8, 1,  0.00000 . .  .
dictionarybasedy = 2.6400000
functionbasedy = 2.6400000
Testing dictionary based y == function based y for loc3, 8, 1,  1.00000 . .  .
dictionarybasedy = 2.6400000
functionbasedy = 2.6400000
Testing dictionary based y == function based y for loc3, 8, 1,  1.00001 . .  .
dictionarybasedy = 3.3285000
functionbasedy = 3.3285000
Testing dictionary based y == function based y for loc3, 8, 1, 5000.00000 . .  .
dictionarybasedy = 0.3291000
functionbasedy = 0.3291000
Testing dictionary based y == function based y for loc3, 8, 2,  0.00000 . .  .
dictionarybasedy = 2.6400000
functionbasedy = 2.6400000
Testing dictionary based y == function based y for loc3, 8, 2,  1.00000 . .  .
dictionarybasedy = 2.6400000
functionbasedy = 2.6400000
Testing dictionary based y == function based y for loc3, 8, 2,  1.00001 . .  .
dictionarybasedy = 3.3285000
functionbasedy = 3.3285000
Testing dictionary based y == function based y for loc3, 8, 2, 5000.00000 . .  .
dictionarybasedy = 0.3291000
functionbasedy = 0.3291000
Testing dictionary based y == function based y for loc3, 9, 1,  0.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 9, 1, 5000.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 9, 2,  0.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc3, 9, 2, 5000.00000 . .  .
dictionarybasedy = 2.4900000
functionbasedy = 2.4900000
Testing dictionary based y == function based y for loc1, 1, 1,  0.00000 . .  .
dictionarybasedy = 2.3500000
functionbasedy = 2.3500000
Testing dictionary based y == function based y for loc1, 1, 1, 5000.00000 . .  .
dictionarybasedy = 2.3500000
functionbasedy = 2.3500000
Testing dictionary based y == function based y for loc1, 1, 2,  0.00000 . .  .
dictionarybasedy = 2.4500000
functionbasedy = 2.4500000
Testing dictionary based y == function based y for loc1, 1, 2, 5000.00000 . .  .
dictionarybasedy = 2.4500000
functionbasedy = 2.4500000
Returning default Y for  loc1, 99,  1,  0.00000 . .  .
Testing dictionary based y == function based y for loc1, 99, 1,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc1, 99,  1, 5000.00000 . .  .
Testing dictionary based y == function based y for loc1, 99, 1, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc1, 99,  2,  0.00000 . .  .
Testing dictionary based y == function based y for loc1, 99, 2,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc1, 99,  2, 5000.00000 . .  .
Testing dictionary based y == function based y for loc1, 99, 2, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Testing dictionary based y == function based y for loc1, 4, 1,  1.29001 . .  .
dictionarybasedy = 6.7215720
functionbasedy = 6.7215720
Testing dictionary based y == function based y for loc1, 4, 1, 5000.00000 . .  .
dictionarybasedy = -9.2743000
functionbasedy = -9.2743000
Testing dictionary based y == function based y for loc1, 4, 1,  0.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 4, 1,  1.29000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 4, 2,  0.00000 . .  .
dictionarybasedy = 2.8215000
functionbasedy = 2.8215000
Testing dictionary based y == function based y for loc1, 4, 2, 5000.00000 . .  .
dictionarybasedy = 1.8215000
functionbasedy = 1.8215000
Testing dictionary based y == function based y for loc1, 5, 1,  1.25001 . .  .
dictionarybasedy = 5.7119500
functionbasedy = 5.7119500
Testing dictionary based y == function based y for loc1, 5, 1, 5000.00000 . .  .
dictionarybasedy = -7.2848000
functionbasedy = -7.2848000
Testing dictionary based y == function based y for loc1, 5, 1,  0.00000 . .  .
dictionarybasedy = 2.4700000
functionbasedy = 2.4700000
Testing dictionary based y == function based y for loc1, 5, 1,  1.25000 . .  .
dictionarybasedy = 2.4700000
functionbasedy = 2.4700000
Testing dictionary based y == function based y for loc1, 5, 2,  0.00000 . .  .
dictionarybasedy = 2.8733000
functionbasedy = 2.8733000
Testing dictionary based y == function based y for loc1, 5, 2, 5000.00000 . .  .
dictionarybasedy = 1.3733000
functionbasedy = 1.3733000
Testing dictionary based y == function based y for loc1, 6, 1,  0.00000 . .  .
dictionarybasedy = 2.5700000
functionbasedy = 2.5700000
Testing dictionary based y == function based y for loc1, 6, 1,  1.31000 . .  .
dictionarybasedy = 2.5700000
functionbasedy = 2.5700000
Testing dictionary based y == function based y for loc1, 6, 1,  1.31001 . .  .
dictionarybasedy = 4.9283420
functionbasedy = 4.9283420
Testing dictionary based y == function based y for loc1, 6, 1, 5000.00000 . .  .
dictionarybasedy = -4.0693000
functionbasedy = -4.0693000
Testing dictionary based y == function based y for loc1, 6, 2,  0.00000 . .  .
dictionarybasedy = 3.0612000
functionbasedy = 3.0612000
Testing dictionary based y == function based y for loc1, 6, 2, 5000.00000 . .  .
dictionarybasedy = 1.0612000
functionbasedy = 1.0612000
Testing dictionary based y == function based y for loc1, 7, 1,  1.31501 . .  .
dictionarybasedy = 6.5440550
functionbasedy = 6.5440550
Testing dictionary based y == function based y for loc1, 7, 1, 5000.00000 . .  .
dictionarybasedy = -8.4520000
functionbasedy = -8.4520000
Testing dictionary based y == function based y for loc1, 7, 1,  0.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 7, 1,  1.31500 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 7, 2,  0.00000 . .  .
dictionarybasedy = 2.9580000
functionbasedy = 2.9580000
Testing dictionary based y == function based y for loc1, 7, 2, 5000.00000 . .  .
dictionarybasedy = -12.5420000
functionbasedy = -12.5420000
Testing dictionary based y == function based y for loc1, 8, 1,  0.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 8, 1, 5000.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 8, 2,  0.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 8, 2, 5000.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc1, 9, 1,  0.00000 . .  .
dictionarybasedy = 2.5300000
functionbasedy = 2.5300000
Testing dictionary based y == function based y for loc1, 9, 1,  1.27500 . .  .
dictionarybasedy = 2.5300000
functionbasedy = 2.5300000
Testing dictionary based y == function based y for loc1, 9, 1,  1.27501 . .  .
dictionarybasedy = 6.4777175
functionbasedy = 6.4777175
Testing dictionary based y == function based y for loc1, 9, 1, 5000.00000 . .  .
dictionarybasedy = 4.9781000
functionbasedy = 4.9781000
Testing dictionary based y == function based y for loc1, 9, 2,  0.00000 . .  .
dictionarybasedy = 2.5400000
functionbasedy = 2.5400000
Testing dictionary based y == function based y for loc1, 9, 2, 5000.00000 . .  .
dictionarybasedy = 2.5400000
functionbasedy = 2.5400000
Testing dictionary based y == function based y for loc4, 1, 1,  0.00000 . .  .
dictionarybasedy = 2.7802000
functionbasedy = 2.7802000
Testing dictionary based y == function based y for loc4, 1, 1, 5000.00000 . .  .
dictionarybasedy = 1.2802000
functionbasedy = 1.2802000
Testing dictionary based y == function based y for loc4, 1, 2,  0.00000 . .  .
dictionarybasedy = 2.7802000
functionbasedy = 2.7802000
Testing dictionary based y == function based y for loc4, 1, 2, 5000.00000 . .  .
dictionarybasedy = 1.2802000
functionbasedy = 1.2802000
Returning default Y for  loc4, 99,  1,  0.00000 . .  .
Testing dictionary based y == function based y for loc4, 99, 1,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc4, 99,  1, 5000.00000 . .  .
Testing dictionary based y == function based y for loc4, 99, 1, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc4, 99,  2,  0.00000 . .  .
Testing dictionary based y == function based y for loc4, 99, 2,  0.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Returning default Y for  loc4, 99,  2, 5000.00000 . .  .
Testing dictionary based y == function based y for loc4, 99, 2, 5000.00000 . .  .
dictionarybasedy = 2.5000000
functionbasedy = 2.5000000
Testing dictionary based y == function based y for loc4, 4, 1,  0.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc4, 4, 1,  1.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc4, 4, 1,  1.00001 . .  .
dictionarybasedy = 4.5528000
functionbasedy = 4.5528000
Testing dictionary based y == function based y for loc4, 4, 1, 5000.00000 . .  .
dictionarybasedy = -5.4452000
functionbasedy = -5.4452000
Testing dictionary based y == function based y for loc4, 4, 2,  0.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc4, 4, 2, 1125.00000 . .  .
dictionarybasedy = 2.6500000
functionbasedy = 2.6500000
Testing dictionary based y == function based y for loc4, 4, 2, 1125.01000 . .  .
dictionarybasedy = 2.6808890
functionbasedy = 2.6808890
Testing dictionary based y == function based y for loc4, 4, 2, 5000.00000 . .  .
dictionarybasedy = -1.5816000
functionbasedy = -1.5816000
Testing dictionary based y == function based y for loc4, 5, 1,  0.00000 . .  .
dictionarybasedy = 4.2758000
functionbasedy = 4.2758000
Testing dictionary based y == function based y for loc4, 5, 1, 5000.00000 . .  .
dictionarybasedy = -4.7242000
functionbasedy = -4.7242000
Testing dictionary based y == function based y for loc4, 5, 2,  0.00000 . .  .
dictionarybasedy = 2.6535000
functionbasedy = 2.6535000
Testing dictionary based y == function based y for loc4, 5, 2, 5000.00000 . .  .
dictionarybasedy = 2.1535000
functionbasedy = 2.1535000
Testing dictionary based y == function based y for loc4, 6, 1,  0.00000 . .  .
dictionarybasedy = 2.5800000
functionbasedy = 2.5800000
Testing dictionary based y == function based y for loc4, 6, 1,  1.10000 . .  .
dictionarybasedy = 2.5800000
functionbasedy = 2.5800000
Testing dictionary based y == function based y for loc4, 6, 1,  1.10001 . .  .
dictionarybasedy = 4.0307700
functionbasedy = 4.0307700
Testing dictionary based y == function based y for loc4, 6, 1, 5000.00000 . .  .
dictionarybasedy = -2.4678000
functionbasedy = -2.4678000
Testing dictionary based y == function based y for loc4, 6, 2,  0.00000 . .  .
dictionarybasedy = 2.8081000
functionbasedy = 2.8081000
Testing dictionary based y == function based y for loc4, 6, 2, 5000.00000 . .  .
dictionarybasedy = 1.8081000
functionbasedy = 1.8081000
Testing dictionary based y == function based y for loc4, 7, 1,  0.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc4, 7, 1,  1.00000 . .  .
dictionarybasedy = 2.6000000
functionbasedy = 2.6000000
Testing dictionary based y == function based y for loc4, 7, 1,  1.00001 . .  .
dictionarybasedy = 4.3884000
functionbasedy = 4.3884000
Testing dictionary based y == function based y for loc4, 7, 1, 5000.00000 . .  .
dictionarybasedy = -4.6098000
functionbasedy = -4.6098000
Testing dictionary based y == function based y for loc4, 7, 2,  0.00000 . .  .
dictionarybasedy = 2.7334000
functionbasedy = 2.7334000
Testing dictionary based y == function based y for loc4, 7, 2, 5000.00000 . .  .
dictionarybasedy = 2.2834000
functionbasedy = 2.2834000



And that's it.  It bails on an AssertionError - I like to fix problems one at a time.  It took me about six runs to get everything matched.





Thank you for stopping by.


sqlcmd faux csv dump and parsing with the csv module
2016-08-11T12:45:00.001-07:00
Lately I had another Excel-VBA-Python one off hack project.  Once again there was the dilemma of not being able to use MSSQL's bcp because my query string was too long.  sqlcmd can run a query from a big SQL file, but, to the best of my knowledge, it does not do csv dumps.



This is a hack.  I would normally go to hell for it, but I've done so many other bad hacks I'd have to declare bankruptcy on my programming soul and start over.  Onward.



mssql query file:



<SQL code>

< . . . variable declarations, temp table declarations, etc. . . . >



DECLARE @COMMA CHAR(1) = ',';
DECLARE @LOSSLESS INT = 3;

DECLARE @DOUBLEQUOTE CHAR(1) = CHAR(34);

-- Concatenate strings.
-- Need quoted strings for stockpiles with spaces.
SELECT @DOUBLEQUOTE + StockpileShortName +

       @DOUBLEQUOTE + @COMMA +

       @DOUBLEQUOTE + StockpileID +

       @DOUBLEQUOTE + @COMMA +
       @DOUBLEQUOTE + StkLoc +

       @DOUBLEQUOTE + @COMMA +
       -- Go for full float precision.
       CONVERT(VARCHAR(35),

               tonnes,

               @LOSSLESS) + @COMMA +

       CONVERT(VARCHAR(35),

               grade01,

               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade02,

               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade03,

               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35), 

               grade04,

               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade05,

               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),

               grade06,

               @LOSSLESS)  
FROM ##inputresultspvctrachte


< . . . ORDER BY clause . . .>

<End SQL code>



It's pretty obvious what I'm doing (and I'd be shocked if I'm the first to do it):  list all my fields on one line separated by commas that are part of the result record.



A couple notes:



1) all my string identifiers are in double quotes; all my float values are in unquoted text - this will help simplify the Python csv module code below.



2) the @LOSSLESS "constant" - Microsoft's SQL documentation doesn't list an enumeration for this per se.  It's just a straight up whole number 3.  I'm a bit obsessive about constants - wrap that baby in a variable declaration!  Lossless double precision means, if I recall correctly, SQL Server will give you seventeen digits of precision.  This works for what I'm doing (mining stockpile management).



The (rough) mssql command to run the query from a DOS prompt:



sqlcmd -S MYSERVERNAME -U MYUSERNAME -P MYPASSWORD -I myqueryfile.sql -o theoutputfile.csv -b



The -b switch provides a Windows error code.  It's a crude check for whether the query parsed OK and ran, but it's better than nothing.



The output looks something like this (sorry about the small font):



<. . . sqlcmd messages . . .>



"KEY003","hakunamatadacopper","good",28776.5,X.XXXXX,X.XXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY005","tembomalachite","not as good",25855.9,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY006","simbacobalt","not as good",156767,X.XXXXXX,X.XXXXXXX,X.XXXXXX,X.XXXXXXX,XX.XXXX,X.XXXXXX
"KEY010","jambocobalt","good",488977,X.XXXXX,X.XXXXXX,X.XXXX,X.XXXXXX,XXX.XXX,X.XXXXX
"KEY015","cucoagogo","good",39576.7,X.XXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY016","greenrock","good",160,X.XXX,X.XXX,X.XXX,X.XXX,XXX.XX,X.XX
"KEY033","pinkrock","not as good",81504.3,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXX
"KEY006","funkyleach","not as good",55866.1,X.XXXXXX,X.XXXXXX,X.XXXXXX,X.XXXXXX,XXX.XXX,X.XXXXXX
"KEY010","metalhome","good",30301.1,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XX,X.XXXXX
"KEY015","boulderpile","good",2878.25,X.XX,X.XX,X.XXX,X.XXX,XX.XXX,X.XXX
"KEY033","berm","not as good",5309.97,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXXX

(11 rows affected)



I've given my stockpiles funny names and X'ed out the numeric grades to sanitize this, but you get the general idea.



Now, finally to some Python code.  I'll get the lines of the file (faux csv) I want and parse them with the csv module reader object.  The whole deal is kind of verbose (I have a collections.namedtuple object that takes each "column" as an attribute).  I'm only going to show the part that segregates the lines I want and reads them with the csv reader.  The wpx module has all of my constants and static data definition in it.  Some of the whitespace issues I still need to work out.  For now I brute force stripped off leading and trailing whitespace from values.



def parsesqlcmdoutput():
    """
    Parse output from sqlcmd.

    Returns list of
    collections.namedtuple
    objects.
    """
    lines = []
    with open(wpx.OUTPUTFILE +

              wpx.CSVEXT, 'r') as f:
        # Get relevant lines.
        # Rip whitespace off end - excessive.

        # XXX - string overloading - hack.
        lines = [linex.strip() for

                 linex in f if
                 linex[0:wpx.STKFLAG[0]] ==

                         wpx.STKFLAG[1]]
    rdr = csv.reader(lines, quoting =

                            QUOTENONN)
    records = []
    for r in rdr:
        # Get rid of whitespace padding

        # around string values.
        for x in xrange(wpx.IHSTRIDX):
            r[x] = r[x].strip()
        records.append(wpx.INPUTRECORD(*r))
    return records

That csv.QUOTENONN (quote non-numeric) is handy.  As per the Python doc, anything that isn't quoted is taken as a float.  As long as my data are clean, I should be good there and it strips out some cruft code-wise.



The list comprehension is an iterable object the same way a file is, so the csv module's reader works fine on it.



That's about it (minus a lot of background code - if you need that, let me know and I'll put it in the comments).



Thanks for stopping by.


 


Using Generators and Coroutines to Merge Tabular Data (Drill Holes)
2016-07-10T10:46:00.000-07:00
I have some mining drill hole data that I need to merge into an old vendor FORTRAN input format.  Basically I do a series of SQL pulls from the drillhole database to csv files, then merge the data.  My methodology has been a bit brute force in matching the separate parts of the drill hole data (lists, opening and closing of files to find matching holes, etc.).  My thought was that I could do this more elegantly and efficiently by iterating through the files with generators.



The ability of generators to communicate with each other via the send() method intrigued me.  I had always been a bit shy about using this language feature.  My csv problem gave me a justification for checking it out.



The reference I used was Dr. Dave Beazley's 2009 Pycon Tutorial.  He does a nice job of explaining things as well as dispatching good advice.  (I disobeyed the good advice in the interest of shoehorning coroutines into my solution; I'll cover this below.)  Beazley defines a coroutine in the sense of generators and the "yield" keyword as generators where "yield" is used more generally.  That is the context I'm using the word "coroutine" in this post.



Given my problem of a one (drill hole start survey) to many (drill hole interval values) relationship, I attempted a very simple (perhaps oversimplified) toy program demo of what I wanted to do with real data:



def coroutinex(subgenerator):
    """
    Generator function that consumes
    a key value sent from a higher
    level generator.  This generator
    yields two tuples of the form
    (<boolean>, data).  The boolean
    value indicates whether the key
    matches the data.

    Returns a generator.
    """
    while True:
        # One entry point for send()/reset.
        keyx = yield
        subdatatop = next(subgenerator)
        if subdatatop[0] == keyx:
            yield (True, subdatatop)
            for subdataloop in subgenerator:
                if subdataloop[0] == keyx:
                    yield (True, subdataloop)
                else:
                    yield (False, subdataloop)
                    break
 

def toplevelgen(topleveliter, coroutinex):
    """
    Top level generator function.

    subgenerator is a generator
    that this generator sends
    a key value to.  The 
    subgenerator yields a two
    tuple that communicates if
    the key matches or not.

    Returns a generator.
    """
    # Get sub generator/coroutine initialized.
    coroutinex.send(None)
    # Variable for dealing with return
    # from sub-generator/coroutine.
    subvalue = False
    for keyx in topleveliter:
        yield keyx
        if subvalue:
            yield subvalue
        subvalue = coroutinex.send(keyx)
        # Get sub generator/coroutine re-initialized
        # after send() reset.
        if subvalue is None:
            # XXX - hack
            subvalue = coroutinex.send(keyx)
        yield subvalue
        for submessage in coroutinex:
            # XXX - another hack to deal with yield of None.
            if not submessage:
                continue
            subvalue = submessage
            # if submessage[0] is True, kick it out.
            if submessage[0]:
                yield submessage
            else:
                # Keep subvalue for after keyvalue
                # yield at top.
                break

topleveliter = range(44, 55)
keysx = [44, 44, 44, 45, 45, 45, 45, 45,
         46, 46, 46, 46, 46, 46, 46, 46,
         47, 47, 47, 48, 48, 48, 48, 48,
         49, 49, 49, 49, 49, 49, 50, 50,
         51, 51, 51, 51, 51, 51, 51, 51,
         52, 52, 52, 52, 52, 52, 52, 52,
         53, 53, 53, 53, 53, 53, 53, 53,
         54, 54, 54, 54, 54, 54, 54, 54]

sequencex = range(1, len(keysx) + 1)
subgenerator = zip(keysx, sequencex)

gensub = coroutinex(subgenerator)
genmain = toplevelgen(topleveliter, gensub)

for x in genmain:
     print(x)





Output:


44
(True, (44, 1))
(True, (44, 2))
(True, (44, 3))
45
(False, (45, 4))
(True, (45, 5))
(True, (45, 6))
(True, (45, 7))
(True, (45, 8))
46
(False, (46, 9))
(True, (46, 10))
(True, (46, 11))
(True, (46, 12))
(True, (46, 13))
(True, (46, 14))
(True, (46, 15))
(True, (46, 16))
47
(False, (47, 17))
(True, (47, 18))
(True, (47, 19))
48
(False, (48, 20))
(True, (48, 21))
(True, (48, 22))
(True, (48, 23))
(True, (48, 24))
49
(False, (49, 25))
(True, (49, 26))
(True, (49, 27))
(True, (49, 28))
(True, (49, 29))
(True, (49, 30))
50
(False, (50, 31))
(True, (50, 32))
51
(False, (51, 33))
(True, (51, 34))
(True, (51, 35))
(True, (51, 36))
(True, (51, 37))
(True, (51, 38))
(True, (51, 39))
(True, (51, 40))
52
(False, (52, 41))
(True, (52, 42))
(True, (52, 43))
(True, (52, 44))
(True, (52, 45))
(True, (52, 46))
(True, (52, 47))
(True, (52, 48))
53
(False, (53, 49))
(True, (53, 50))
(True, (53, 51))
(True, (53, 52))
(True, (53, 53))
(True, (53, 54))
(True, (53, 55))
(True, (53, 56))
54
(False, (54, 57))
(True, (54, 58))
(True, (54, 59))
(True, (54, 60))
(True, (54, 61))
(True, (54, 62))
(True, (54, 63))
(True, (54, 64))



Back to Dr. Beazley's advice - he doesn't recommend this - even though "yield" is the keyword, it means two different things in two different contexts.  Do not mix generator and coroutine functionality.  I'm going ahead in this post and doing it anyway.  I don't have an excuse.  It does remind me of some old Bob Dylan lyrics:




Now the rainman gave me two cures
Then he said, "Jump right in"
The one was Texas medicine
The other was just railroad gin
An' like a fool I mixed them
An' it strangled up my mind






It's OK, Bob, some of us just need to learn things the hard way.
 

Onward.






A brief diversion on drill holes - the data for a small scale (about 2,000 feet or less) geotechnical or gelogic drill hole come back in three parts:


1) collar - where the hole starts in space (coordinates).






2) surveys - where the hole ends up going in space relative to the collar (drill pipe has proven to be amazingly flexible when passing through rock).






3) assays - usually the hole is sampled along intervals and chemically or physically analyzed.  The assay intervals may or may not coincide with survey intervals.




Clear as (drilling) mud?  Great - back to Python.






The problem:






Three tabular csv dumps from SQL - a collar file, a survey file, and an assay file.  Each has a unique key in the first column that matches across files (the drill hole key).  On the SQL side I have ensured that there are no orphan key rows in any of the three files and that all three are sorted on the key.






I present the sanitized ouput here first - it will give some context to the domain specific parts of the code:




XXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,26.4529
XXXXX,0.0000,1.1925,1.1925,283.5688,-13.5310  
XXXXX,1.1925,4.2760,3.0836,284.6224,1.9328      SURVEYS
XXXXX,4.2760,6.3799,2.1039,280.2829,-3.1334       GO
XXXXX,6.3799,9.7024,3.3225,282.5794,2.3632       HERE
XXXXX,9.7024,11.8701,2.1677,285.4406,-1.1631     AFTER
XXXXX,11.8701,13.6920,1.8219,275.9462,-5.0698    COLLAR
XXXXX,13.6920,17.1199,3.4279,285.4561,1.9560    LOCATION
XXXXX,17.1199,19.6944,2.5746,279.2318,-0.7344
XXXXX,19.6944,22.5857,2.8913,282.1947,4.3241
XXXXX,22.5857,24.1879,1.6022,283.8367,-1.7525
XXXXX,24.1879,26.4529,2.2650,287.3820,13.4805
XXXXX                             <----- LEGACY DRILLHOLE NUMBER
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.            ASSAYS
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.              GO
XXXXX,X.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.            HERE
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
                                                <----- BLANK LINE
XXXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,23.5411
XXXXXX,0.0000,2.5781,2.5781,135.0157,2.3341
XXXXXX,2.5781,5.0351,2.4570,137.1873,5.5353
XXXXXX,5.0351,7.3706,2.3354,135.2276,7.7020
XXXXXX,7.3706,9.9168,2.5462,136.4253,6.4493
                .
                .
                .
                .
                .
                .
                .
               etc.






And the code (sorry about the size - it got messier than I would have hoped):






#!C:\Python35\python

"""
Parse collar, survey, and assay dumps for
trenches from vendor drill hole RDBMS.

Write specially formatted data file for
consumption by old vendor FORTRAN
routine 201.
"""

import csv
from collections import namedtuple
from collections import OrderedDict

COLLAR = './data/collar.csv'
SURVEY = './data/survey.csv'
ASSAYS = './data/assays.csv'
DAT201 = './data/TR.dat'

# collar (ssit) fields
ID = 'drillholeid'
NAME = 'drillholename'
DATE = 'drillholedate'
LEGACY = 'drillholehistoricname'
X = 'collarx'
Y = 'collary'
Z = 'collarz'
AZ = 'azimuth'
DIP = 'dip'
LEN = 'drillholelength'

COLLARFIELDS = [ID, NAME, DATE, LEGACY, X, Y, Z,
                AZ, DIP, LEN]

# survey fields
FROM = 'fromx'
TO = 'depthto'
SAMPLEN = 'surveylength'
AZ = 'azimuth'
DIP = 'dip'

SURVEYFIELDS = [ID, NAME, DATE, LEGACY, FROM, TO,
                SAMPLEN, AZ, DIP]

# assay fields
AFROM = 'assayfrom'
ATO = 'assayto'
AI = 'assayinterval'
ASSAY1 = 'assay1'
ASSAY2 = 'assay2'
ASSAY3 = 'assay3'
ASSAY4 = 'assay4'
ASSAY5 = 'assay5'
ASSAY6 = 'assay6'
ASSAY7 = 'assay7'
ASSAY8 = 'assay8'

ASSAYFIELDS = [ID, NAME, LEGACY, AFROM, ATO, AI, ASSAY1,
               ASSAY2, ASSAY3, ASSAY4, ASSAY5, ASSAY6, ASSAY7, ASSAY8]

ASSAYFORMAT = '.2f'
SURVEYFORMAT = '.4f'

COMMA = ','

# Output for 201 file format.
# Collars.
COLOUTPUTCOLS = [X, Y, Z, AZ, DIP, LEN]
COLFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in COLOUTPUTCOLS]
# Surveys.
SURVOUTPUTCOLS = [FROM, TO, SAMPLEN, AZ, DIP]
SURVFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in SURVOUTPUTCOLS]
# Assays.
ASSYOUTPUTCOLS = [AFROM, ATO, AI, ASSAY1, ASSAY2, ASSAY3, ASSAY4, ASSAY5,
                  ASSAY6, ASSAY7, ASSAY8]
ASSYOUTPUTFMTS = 3 * [SURVEYFORMAT] + 8 * [ASSAYFORMAT]
# Have to use this repeatedly - hence list.
ASSYFMTOUTPUT = list(zip(ASSYOUTPUTCOLS, ASSYOUTPUTFMTS))

RETCHAR = '\n'

# For tracking which dataset we're
# dealing with.
SURVEYSUBDATA = 'survey'
ASSAYSUBDATA = 'assay'

# For survey/assay dictionary.
COR = 'coroutine'
FMT = 'format'
LAST = 'lastvalue'
END = 'end'

INFOMESSAGE = 'Now doing hole number {0} . . .'

def makecsvdatagenerator(csvrdr, ntname, ntfields):
    """
    Returns a generator that yields csv
    row records as named tuple objects.

    csvrdr is the csv.reader object. 

    ntname is the name given to the
    collections.namedtuple object.

    ntfields is the list of field names
    for the collections.namedtuple object. 
    """
    namedtup = namedtuple(ntname, ntfields)
    return (namedtup(*linex) for linex in csvrdr)

def formatassay(numstring, formatx):
    """
    Returns a string representing a float
    that typically is in 0.00 format, but
    other float formats can be applied.

    numstring is a string representing a float.

    formatx is the desired format (Python 3 format string).
    """
    return(format(float(numstring), formatx))

def getnumericstrings(record, formats):
    """
    Returns list of strings.

    record is a collections.namedtuple instance.

    formats is a list of two-lists of namedtuple
    attributes and numeric string formats to be
    applied to each attribute's value.
    """
    return [formatassay(record.__getattribute__(pairx[0]),
                                                pairx[1])
            for pairx in formats]

def coroutinex(subgenerator):
    """
    Generator function.
  
    Consumes key value and yields
    two tuple of (<boolean>,
    next(subgenerator)) in response.
    boolean value indicates
    whether key matches first
    value of subgenerator namedtuple.

    subgenerator is a generator of
    namedtuples.

    Returns a generator.
    """
    while True:
        keyx = yield
        subdatatop = next(subgenerator)
        if subdatatop.drillholeid == keyx:
            yield (True, subdatatop)
            for subdataloop in subgenerator:
                if subdataloop.drillholeid == keyx:
                    yield (True, subdataloop)
                else:
                    yield (False, subdataloop)
                    break
        # Case where only one interval in
        # drill hole.
        else:
            yield (False, subdatatop)

def formatdataline(record, formats):
    """
    Prepare record as a line
    of text for write to file.

    record is a collections.namedtuple
    object.

    formats is a list of two tuples of
    namedtuple attributes and numeric
    string formats.

    Returns string.
    """
    recordline = [record.drillholehistoricname]
    recordline.extend(getnumericstrings(record,
                                        formats))
    return COMMA.join(recordline) + RETCHAR

def dealwithsend(subgen, sendval):
    """
    Helper function to clean up code.
    Deals with initial receipt of
    None value upon send() and
    re-sends value.

    Sends value sendval to
    generator/coroutine subgen.

    Returns two tuple of (<boolean>,
    <collections.namedtuple>).
    """
    retval = subgen.send(sendval)
    if retval is None:
        retval = subgen.send(sendval)
    return retval

def dealwithyieldrecord(survassay, subdata):
    """
    Helper function to clean up code.

    Formats values for write to file.

    survassay is a dictionary of values.

    subdata is the dictionary key that
    tells which data is being handled
    (survey or assay).
    """
    return formatdataline(survassay[subdata][LAST][1],
                          survassay[subdata][FMT])

def cyclecollars(collargen,
                 survassay):
    """
    Generator function that yields
    data (strings) for write to a
    a specially formatted drill hole
    file.

    This is the top level generator
    for working the merging of 
    drillhole data (collars, surveys,
    assays).

    survassay is a collections.OrderedDict
    object that references the respective
    survey and assay generators and holds
    information for tracking which subset
    of data (surveys or assays) are being
    worked.
    """
    for record in collargen:
        keyx = record.drillholeid
        label = record.drillholehistoricname
        survassay[SURVEYSUBDATA][END] = label + RETCHAR
        print(INFOMESSAGE.format(label))
        yield formatdataline(record, COLFMTOUTPUT)
        for subdata in survassay:
            fmt = survassay[subdata][FMT]
            if survassay[subdata][LAST]:
                yield dealwithyieldrecord(survassay, subdata)
            subvalue = dealwithsend(survassay[subdata][COR], keyx)
            # Case where only one interval.
            if not subvalue[0]:
                survassay[subdata][LAST] = subvalue
                yield survassay[subdata][END]
                continue
            yield formatdataline(subvalue[1], fmt)
            for submessage in survassay[subdata][COR]:
                # End of iteration.
                if submessage is None:
                    yield survassay[subdata][END]
                    break
                if submessage[0]:
                    yield formatdataline(submessage[1], fmt)
                else:
                    survassay[subdata][LAST] = submessage
                    yield survassay[subdata][END]
                    break

def main():
    """
    Parse csv dumps from SQL and write
    drillhole data fields for import
    to old vendor FORTRAN based binary
    files.

    Side effect function.
    """
    with open(COLLAR, 'r') as colx:
        colcsv = csv.reader(colx)
        collargen = makecsvdatagenerator(colcsv,
                                         'collars',
                                         COLLARFIELDS)
        with open(SURVEY, 'r') as svgx:
            survcsv = csv.reader(svgx)
            survgen = makecsvdatagenerator(survcsv,
                                           'surveys',
                                           SURVEYFIELDS)
            surveycoroutinex = coroutinex(survgen)
            with open(ASSAYS, 'r') as assx:
                assycsv = csv.reader(assx)
                assygen = makecsvdatagenerator(assycsv,
                                               'assays',
                                               ASSAYFIELDS)
                assaycoroutinex = coroutinex(assygen)
                with open(DAT201, 'w') as d201:
                    # Get sub generators/coroutines initialized.
                    surveycoroutinex.send(None)
                    assaycoroutinex.send(None)
                    surveyassay = OrderedDict()
                    surveyassay[SURVEYSUBDATA] = {COR:surveycoroutinex,
                                                  FMT:SURVFMTOUTPUT,
                                                  LAST:None,
                                                  END:None}
                    surveyassay[ASSAYSUBDATA] = {COR:assaycoroutinex,
                                                 FMT:ASSYFMTOUTPUT,
                                                 LAST:None,
                                                 END:RETCHAR}
                    colgenx = cyclecollars(collargen,
                                           surveyassay)
                    for linex in colgenx:
                        d201.write(linex)
    print('Done')

if __name__ == '__main__':
    main() 






The bad news: this was more difficult with a real world dataset than I anticipated.  Beazley's admonition was an apt one.

  



The good news:  it does perform better than my previous brute force implementations.  From the standpoint of iterating through datasets and not wasting resources (even with the polling or interrupting or whatever facilitates the generator communication closer to the metal), this is a better implementation.  Also, I learned a bit more about the "yield" keyword.

Thanks for stopping by.  


 


















  



7-Zip-JBinding API with jython on Windows
2016-04-18T09:52:00.000-07:00
I have a set of multi-GB Windows folders that I need to archive in 7-zip format each month.  I'd prefer not to use the mouse to compress the folders "manually."  Also, I didn't want to use the command line with the subprocess module like I have with some other programs.  Ideally, I wanted to control 7zip programmatically.  The 7-Zip-JBinding libraries offered a means to do this from jython.



7-Zip-JBinding is written using java Interfaces that are structured pretty specifically.  I did not venture too far away from the examples given in the 7-Zip-JBinding documentation.  I smithed two modules for my own purposes, compressing and uncompressing, and present them (java code) below.  The decompression one has a separate method for retrieving paths of the compressed files.  This is not efficient, but for what I need to do, and for the limitations of the library and the approach, it works out for the best.



import java.io.IOException;
import java.io.RandomAccessFile;

import net.sf.sevenzipjbinding.IOutCreateArchive7z;
import net.sf.sevenzipjbinding.IOutCreateCallback;
import net.sf.sevenzipjbinding.IOutItem7z;
import net.sf.sevenzipjbinding.ISequentialInStream;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.OutItemFactory;
import net.sf.sevenzipjbinding.impl.RandomAccessFileOutStream;
import net.sf.sevenzipjbinding.util.ByteArrayStream;



/* Off StackOverflow - works for getting
 * file content/bytes from path */
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;



public class SevenZipThing {



    private static final String RETCHAR = "\n";
    private static final String INTFMT = "%,d";
    private static final String BYTESTOCOMPRESS = " bytes total to compress\n";
    private static final String ERROCCURS = "Error occurs: ";
    private static final String COMPRESSFILE = "\nCompressing file ";
    private static final String RW = "rw";
    private static final int LVL = 5;
    private static final String SEVZERR = "7z-Error occurs:";
    private static final String ERRCLOSING = "Error closing archive: ";
    private static final String ERRCLOSINGFLE = "Error closing file: ";
    private static final String SUCCESS = "\nCompression operation succeeded\n";



    private String filename;

    /* String[] array conversion from jython list
     * implicit and poses no problems (JKD7) */
    private String[] pathsx;



    public SevenZipThing(String filename, String[] pathsx) {
        this.filename = filename;
        this.pathsx = pathsx;
    }



    /**
     * The callback provides information about archive items.
     */
    /** 


   * I copied this straight from the sevenZipJBinding's author's
     * code - but I haven't put much in to deal with messaging
     * or error handling
   * */
    private final class MyCreateCallback 
            implements IOutCreateCallback<IOutItem7z> {



        public void setOperationResult(boolean operationResultOk)
                throws SevenZipException {
            // Track each operation result here
        }



        public void setTotal(long total) throws SevenZipException {
            // Track operation progress here
    
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                     BYTESTOCOMPRESS);
        }



        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }



        public IOutItem7z getItemInformation(int index,
                OutItemFactory<IOutItem7z> outItemFactory) {

            IOutItem7z item = outItemFactory.createOutItem();

            Path path = Paths.get(pathsx[index]);
            item.setPropertyPath(pathsx[index]);
            try {
                // Java arrays are limited to 2 ** 31 items - small.
                byte[] data = Files.readAllBytes(path);
                item.setDataSize((long) data.length);
                return item;
            // XXX - I could do a lot better than this (error handling).
            } catch (Exception e) {
                System.err.println(ERROCCURS + e);
            }
            return null;
        }



        public ISequentialInStream getStream(int i)
            throws SevenZipException {

            Path path = Paths.get(pathsx[i]);

            try {
                byte[] data = Files.readAllBytes(path);
                System.out.println(COMPRESSFILE + path);
                return new ByteArrayStream(data, true);
            } catch (Exception e) {
                System.err.println(ERROCCURS + e);
            }
            return null;
        }
    }



    public void compress() {
        
        /* Mostly copied from sevenZipJBinding's author's code -
         * I made the compress method public to work from jython.
         * Also, I deal with all of the file listing in jython
         * and just pass a list to this class. */

        boolean success = false;
        RandomAccessFile raf = null;
        IOutCreateArchive7z outArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);

            // Open out-archive object
            outArchive = SevenZip.openOutArchive7z();

            // Configure archive
            outArchive.setLevel(LVL);
            outArchive.setSolid(true);

      // All available processors.
      outArchive.setThreadCount(0);

            // Create archive
            outArchive.createArchive(new RandomAccessFileOutStream(raf),
                    pathsx.length, new MyCreateCallback());
            success = true;
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (outArchive != null) {
                try {
                    outArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                    success = false;
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                    success = false;
                }
            }
        }
        if (success) {
            System.out.println(SUCCESS);
        }
    }
}



import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;

import java.util.Arrays;
import java.util.ArrayList;

import net.sf.sevenzipjbinding.IInArchive;
import net.sf.sevenzipjbinding.PropID;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.RandomAccessFileInStream;
import net.sf.sevenzipjbinding.IArchiveExtractCallback;
import net.sf.sevenzipjbinding.ExtractOperationResult;
import net.sf.sevenzipjbinding.ExtractAskMode;
import net.sf.sevenzipjbinding.ISequentialOutStream;

/* 7z archive format */
/* SEVEN_ZIP is the one I want */
import net.sf.sevenzipjbinding.ArchiveFormat;



public class SevenZipThingExtract {

    private String filename;
    private String extractdirectory;
    private ArrayList<String> foldersx = null;
  private boolean subdirectory = false;

    private static final String ERROPENINGFLE = "Error opening file: ";
    private static final String ERRWRITINGFLE = "Error writing to file: ";
    private static final String EXTERR = "Extraction error";
    private static final String INFOFMT = "%9X | %10s | %s";
    private static final String RETCHAR = "\n";
    private static final String INTFMT = "%,d";
    private static final String BYTESTOEXTRACT = " bytes total to extract\n";
    private static final String RW = "rw";
    private static final String BACKSLASH = "\\";
    private static final String SEVZERR = "7z-Error occurs:";
    private static final String ERROCCURS = "Error occurs: ";
    private static final String ERRCLOSING = "Error closing archive: ";
    private static final String ERRCLOSINGFLE = "Error closing file: ";



    public SevenZipThingExtract(String filename, String extractdirectory,
                                boolean subdirectory) {
        this.filename = filename;
        foldersx = new ArrayList<String>();
        this.foldersx = foldersx;
        this.extractdirectory = extractdirectory;
        this.subdirectory = subdirectory;
    }



    private final class MyExtractCallback 
            implements IArchiveExtractCallback {

        // Copied mostly from example.
        private int hash = 0;
        private int size = 0;
        private int index;
        private boolean skipExtraction;
        private IInArchive inArchive;

        private OutputStream outputStream;
        private File file;



        public MyExtractCallback(IInArchive inArchive) {
            this.inArchive = inArchive;
        }



        @Override
        public ISequentialOutStream getStream(int index,
                          ExtractAskMode extractAskMode)
                throws SevenZipException {


             this.index = index;
             // I'm not skipping anything.
             skipExtraction = (Boolean) false;

             String path = (String) inArchive.getProperty(index, PropID.PATH);
             // Try preprending extractdirectory.
             if (subdirectory) {
                 path = extractdirectory + BACKSLASH + path.substring(2);
             } else {
                 path = extractdirectory + BACKSLASH + path;
             }
             file = new File(path);

            try {
                outputStream = new FileOutputStream(file);
            } catch (FileNotFoundException e) {
                throw new SevenZipException(ERROPENINGFLE
                        + file.getAbsolutePath(), e);
            }
            return new ISequentialOutStream() {
                public int write(byte[] data) throws SevenZipException {
                   try {
                       outputStream.write(data);
                   } catch (IOException e) {
                       throw new SevenZipException(ERRWRITINGFLE
                               + file.getAbsolutePath());
                   }
                   return data.length; // Return amount of consumed data
                }
            };
       }



        public void prepareOperation(ExtractAskMode extractAskMode)
                throws SevenZipException {
        }

        public void setOperationResult(ExtractOperationResult extractOperationResult)
                throws SevenZipException {
            // Track each operation result here
            if (extractOperationResult != ExtractOperationResult.OK) {
                System.err.println(EXTERR);
            } else {
                System.out.println(String.format(INFOFMT, hash, size,// 
                        inArchive.getProperty(index, PropID.PATH)));
                hash = 0;
                size = 0;
            }
        }



        public void setTotal(long total) throws SevenZipException {
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                             BYTESTOEXTRACT);
        }



        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }
    }



    private final class MyGetPathsCallback 
            implements IArchiveExtractCallback {

        // Copied mostly from example.
        private int hash = 0;
        private int size = 0;
        private int index;
        private boolean skipExtraction;
        private IInArchive inArchive;

        public MyGetPathsCallback(IInArchive inArchive) {
            this.inArchive = inArchive;
        }

        public ISequentialOutStream getStream(int index,
            ExtractAskMode extractAskMode)
                throws SevenZipException {
             this.index = index;
             // I'm not skipping anything.
             skipExtraction = (Boolean) false;

             String path = (String) inArchive.getProperty(index,
                 PropID.PATH);
             foldersx.add(path);

             return new ISequentialOutStream() {
                public int write(byte[] data) throws SevenZipException {
                    hash ^= Arrays.hashCode(data);
                    size += data.length;
                    // Return amount of processed data
                    return data.length;
                }
            };
        }



        public void prepareOperation(ExtractAskMode extractAskMode)
                throws SevenZipException {
        }



        public void setOperationResult(ExtractOperationResult extractOperationResult)
                throws SevenZipException {
            // Track each operation result here
            if (extractOperationResult != ExtractOperationResult.OK) {
                System.err.println(EXTERR);
            } else {
                System.out.println(String.format(INFOFMT, hash, size,
                        inArchive.getProperty(index, PropID.PATH)));
                hash = 0;
                size = 0;
            }
        }



        public void setTotal(long total) throws SevenZipException {
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                BYTESTOEXTRACT);
        }



        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }
    }



    public void extractfiles() {
        
        boolean success = false;
        RandomAccessFile raf = null;
        IInArchive inArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);

            inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP, 
                    new RandomAccessFileInStream(raf));

            int itemCount = inArchive.getNumberOfItems();
            
            // From StackOverflow - could use IntStream,
            // but that's Java 1.8 (using 1.7).
            int[] fileindices = new int[itemCount];
            for(int k = 0; k < fileindices.length; k++)
                fileindices[k] = k;
            inArchive.extract(fileindices, false,
                new MyExtractCallback(inArchive));
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (inArchive != null) {
                try {
                    inArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                }
            }
        }
    }



    public ArrayList<String> getfolders() {
        
        boolean success = false;
        RandomAccessFile raf = null;
        IInArchive inArchive = null;

        try {
            raf = new RandomAccessFile(filename, RW);

            inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP, 
                    new RandomAccessFileInStream(raf));

            int itemCount = inArchive.getNumberOfItems();
            
            // From StackOverflow - could use IntStream,
            // but that's Java 1.8 (using 1.7).
            int[] fileindices = new int[itemCount];
            for(int k = 0; k < fileindices.length; k++)
                fileindices[k] = k;
            inArchive.extract(fileindices, false,
                new MyGetPathsCallback(inArchive));
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (inArchive != null) {
                try {
                    inArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                }
            }
        }
        return foldersx;
    }
}



The method getfolders in the SevenZipThingExtract class is the extra method to get the list of folders.  As noted in the jython code below, the limitations on the number of bytes and files to be compressed necessitates splitting larger files into chunks.  Also, for my specific use case, I need to extract files to a specific folder and set of subfolders.  My methodology is outlined in the comments in the jython code.  The good news:  if I get run over by a bus and the uncompression part of the program gets lost, people will be able to get the files back with some effort.  The bad news:  they will be cursing my headstone.  You do the best you can.

The three jython modules - the first one, folderstozip.py is just constants:


#!java -jar C:\jython-2.7.0\jython.jar



# folderstozip.py



"""
Constants used in compression and
decompression.
"""



FRONTSLASH = '/'
BACKSLASH = '\\'
EMPTY = ''
SAMEFOLDER = './'
SAMEFOLDERWIN = u'.\\'

SPLITFILETRACKER = 'SPLITFILETRACKER.csv'

SPLITFILE = '{0:s}.{1:s}'

UCOMMA = u','



# 3rd party sevenZipJBindings library.
PATH7ZJB = 'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJB += '/Backup/sevenzipjbinding/lib/sevenzipjbinding.jar'



# OS specific 3rd party sevenZipJBindings library.
PATH7ZJBOSSPEC = r'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJBOSSPEC += '/Backup/sevenzipjbinding/lib/sevenzipjbinding-Windows-amd64.jar'



PROGFOLDER = 'C:/MSPROJECTS/EOMReconciliation/2016/03March/Backup'
PROGFOLDER += FRONTSLASH



# Informational messages.
WROTEFILE = 'Wrote file {:s}\n'
SPLITFILEMSG = 'Have now split {0:,d} bytes of file {1:s} into {2:d} {3:,d} chunks.\n'
DONESPLITTING = '\nDone splitting file'
FILESAFTERSPLIT = '\n{:d} files after split'

COMPRESSING = '\nCompressing file {:s} . . .\n'
DELETING = '\nDeleting file {:s} . . .\n'
DELETINGDIR = '\nNow deleting {:s} . . .\n'



# Room for 9999 file names.
UNIQUEX = '{0:05d}'



# XXX - multiple file archives limited to
#       10KB - reason unknown - crashes jvm
#       with IInStream interface class not 
#       found.
# XXX - choked on 8700 bytes - try dropping
#       this from 9500 to 8500.
MULTFILELIMIT = 8500
HALFLIMIT = MULTFILELIMIT/2

# About 50 splits for a 3GB file.
CHUNK = 2 ** 26



# Path plus split number.
FILEN = r'{0:s}.{1:03d}'

# Path plus basefilename.
FILEB = r'{0:s}{1:s}'



# Read/Write constants.
RB = 'rb'
WB = 'wb'
W = 'w'



# Filename plus split number.
ARCHIVEX = '{0:s}/{1:s}.7z'


# multifile archive
MULTARCHIVEX = '{0:s}/archive{1:03d}.7z'
MULTFILES = '. . . multiple files'



# File categories.
# Size less than HALFLIMIT.
SMALL = 'small'
# Size greater than or equal to HALFLIMIT but
# less than or equal to CHUNK.
MEDIUM = 'medium'
# Larger than CHUNK.
LARGE = 'large'



BASEPATH = 'basepath'


FILES = 'files'



# XXX - this folder has recognizable
#       folder names within your domain
#       space - mine are open pit mining
#       area names.
BASEDIRS = ['Pit-1', 'Pit-2', 'Pit-3']



#!java -jar C:/jython-2.7.0/jython.jar



# sevenzipper.py



"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to 7zip up MineSight project
files.
"""



import folderstozip as fld

# Need to adjust path to get necessary jar imports.
import sys

# Need for os.path
import os



# Original path of file plus split number.
SPLITFILERECORD = '{0:s},{1:03d}'



sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)



# java 7zip library
import SevenZipThing as z7thing



# For copying files to program
# directory and deleting the old
# ones where necessary.
import shutil

# For unique archive names.
import itertools


COUNTERX = itertools.count(0, 1)



def splitfile(originalfilepath, splitfilestrackerfile):
    """
    Split file at (string) originalfilepath
    into fld.CHUNK sized chunks and indicate
    sequence by number in new split file
    name.

    Return generator of relative file paths
    inside project folder.

    originalfilepath is the path of the
    file that needs to be split into parts.

    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.
    """
    sizeoffile = os.path.getsize(originalfilepath)
    chunks = sizeoffile/fld.CHUNK + 1
    # Counter.
    i = 1
    with open(originalfilepath, fld.RB) as f:
        while i < chunks + 1:
            with open(fld.FILEN.format(originalfilepath, i), fld.WB) as f2:
                f2.write(f.read(fld.CHUNK))
                print(fld.WROTEFILE.format(fld.FILEN.format(originalfilepath, i)))
                print(fld.SPLITFILEMSG.format(f.tell(), originalfilepath, i, fld.CHUNK))
                print >> splitfilestrackerfile, (SPLITFILERECORD.format(originalfilepath, i))
                i += 1
    print(fld.DONESPLITTING)
    print(fld.FILESAFTERSPLIT.format(i - 1))
    return (fld.FILEN.format(originalfilepath, x) for x in xrange(1, i))



def movefiles(movefilesx, intermediatepath):
    """
    Move files from MineSight project directory
    to program directory.

    Return a list of base file names for the
    moved files.

    movefilesx is a generator of file paths.

    intermediatepath is a string relative path
    between the program folder and the sub-folder
    of the MineSight directory (_msresources/06SOLIDS,
    for example).
    """
    # Move files to that folder.
    movedfiles = []
    for pathx in movefilesx:
        shutil.move(pathx, fld.PROGFOLDER + intermediatepath +
                    os.path.basename(pathx))
        movedfiles.append(intermediatepath + os.path.basename(pathx))
    return movedfiles



def copyfiles(copyfilesx, intermediatepath):
    """
    Copy files from MineSight project directory
    to program directory.

    Return a list of base file names for the
    copied files.

    copyfilesx is a generator of file paths.

    intermediatepath is a string relative path
    between the program folder and the sub-folder
    of the MineSight directory (_msresources/06SOLIDS,
    for example).
    """
    # Copy files to that folder.
    copiedfiles = []
    for pathx in copyfilesx:
        shutil.copyfile(pathx, fld.PROGFOLDER + intermediatepath +
                        os.path.basename(pathx))
        copiedfiles.append(intermediatepath + os.path.basename(pathx))
    return copiedfiles



def compressfilessingle(filestocompress, prefix, basedir):
    """
    Compresses files into an archive.

    This is for larger files that take up
    an entire archive (7z file).

    filestocompress is a list of paths of
    files to be compressed.  These files
    reside inside the program directory.

    prefix is a string path addition, usually
    './' that allows the function to deal
    with relative paths for files that reside
    in subfolders.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Side effect function.
    """
    for pathx in filestocompress:
        basename = os.path.split(pathx)[1]
        # Need unique name for subfolder files with same names.
        uniqueid = fld.UNIQUEX.format(COUNTERX.next())
        uniquename = uniqueid + basename
        print(fld.COMPRESSING.format(prefix + basename))
        archx = z7thing(fld.ARCHIVEX.format(basedir, uniquename),
                        [prefix + basename])
        archx.compress()



def compressfilesmultiple(filestocompress, indexx, basedir):
    """
    Compresses files into an archive.

    filestocompress is a list of paths of
    files to be compressed.  These files
    reside inside the program directory.

    indexx is an integer that gives the
    archive a unique name.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Side effect function.
    """
    print(fld.COMPRESSING.format(fld.MULTFILES))
    archx = z7thing(fld.MULTARCHIVEX.format(basedir, indexx),
                                            filestocompress)
    archx.compress()



def segregatefiles(directoryx, basefiles):
    """
    From a string directory path directoryx
    and a list of base file names, returns
    a dictionary of lists of files and their
    sizes sorted on size and keyed on file
    category.
    """
    retval = {}
    # Add separator to end of directory path.
    directoryx += fld.FRONTSLASH
    # Get all files in folder and their sizes.
    allfiles = [(os.path.getsize(fld.FILEB.format(directoryx, filex)), filex)
                 for filex in basefiles]
    retval[fld.SMALL] = [x for x in allfiles if x[0] < fld.HALFLIMIT]
    retval[fld.SMALL].sort()
    retval[fld.MEDIUM] = [x for x in allfiles if x[0] >= fld.HALFLIMIT and
                          x[0] <= fld.CHUNK]
    retval[fld.MEDIUM].sort()
    retval[fld.LARGE] = [x for x in allfiles if x[0] > fld.CHUNK]
    retval[fld.LARGE].sort()
    return retval



def deletefiles(movedfiles):
    """
    Delete files that have been compressed.

    movedfiles is a list of paths of
    files that have been moved or copied to
    the program directory for compression.

    Side effect function.
    """
    for pathx in movedfiles:
        print(fld.DELETING.format(pathx))
        os.remove(pathx)



def getsmallfilegroupings(smallfiles):
    """
    Generator function that yields
    a list of files whose sum is 
    less than the program's limit
    for bytes to be archived in a 
    multiple file archive.

    smallfiles is a list of two tuples
    of (filesize in bytes, file path).
    """
    lenx = len(smallfiles)
    insidecounter1 = 0
    insidecounter2 = 1
    sumx = 0
    while (insidecounter2 < (lenx + 1)):
        sumx = sum(x[0] for x in smallfiles[insidecounter1:insidecounter2])
        if sumx > fld.MULTFILELIMIT:
            # Back up one.
            insidecounter2 -= 1
            yield (x[1] for x in smallfiles[insidecounter1:insidecounter2])
            # Reset and advance counters.
            sumx = 0
            insidecounter1 = insidecounter2 + 1
            insidecounter2 = insidecounter1 + 1
        else:
            insidecounter2 += 1



def compresslargefiles(largefiles, dirx, prefix, basedir, splitfilestrackerfile):
    """
    Deal with compression of files that need to
    be split prior to compression.

    largefiles is a list of two tuples of file
    sizes and names.

    dirx is the directory (str) in which the files
    are located.

    prefix is a string prefix to augment path
    identification for compression.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.

    Side effect function.
    """
    for filex in largefiles:
        # Get generator of paths of splits.
        splitfiles = splitfile(fld.FILEB.format(dirx, filex[1]),
                               splitfilestrackerfile)
        movedfiles = movefiles(splitfiles, prefix)
        compressfilessingle(movedfiles, prefix, basedir)
        deletefiles(movedfiles)

def compressmediumfiles(mediumfiles, dirx, prefix, basedir):
    """
    Deal with compression of files that need to
    be compressed each to its own archive.

    mediumfiles is a list of two tuples of file
    sizes and paths.

    dirx is the directory (str) in which the files
    are located.

    prefix is a string prefix to augment path
    identification for compression.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Side effect function.
    """
    filestocopy = (dirx + x[1] for x in mediumfiles)
    copiedfiles = copyfiles(filestocopy, prefix)
    compressfilessingle(copiedfiles, prefix, basedir)
    deletefiles(copiedfiles)

def compresssmallfiles(smallfiles, dirx, prefix, indexx, basedir):
    """
    Deal with compression of files that can be
    compressed in groups.

    mediumfiles is a list of two tuples of file
    sizes and paths.

    dirx is the directory (str) in which the files
    are located.

    prefix is a string prefix to augment path
    identification for compression.

    indexx is the current index that the 7zip
    file counter (ensures unique archive name)
    is on.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Returns integer for current archive counter
    index.
    """
    smallgroupings = getsmallfilegroupings(smallfiles)
    while True:
        try:
            grouplittlefiles = smallgroupings.next()
            littlefiles = (dirx + x for x in grouplittlefiles)
            copiedfiles = copyfiles(littlefiles, prefix)
            compressfilesmultiple(copiedfiles, indexx, basedir)
            indexx += 1
            deletefiles(copiedfiles)
        except StopIteration:
            break
    return index



# XXX - hack
def matchbasedir(folderlist):
    """
    Get MineSight project folder name
    that matches a folder in the path
    in question.

    folderlist is a list (in order)
    of directories in a path.

    Returns string.
    """
    for folderx in folderlist:
        for projx in fld.BASEDIRS:
            if projx == folderx:
                return folderx
    return None



def getbasedir(pathx):
    """
    Returns two tuple of strings for
    basedir and basefolder (project
    directory name and base path under
    project directory copied to program
    directory).

    pathx is the directory path being
    processed (str).
    """
    # basedir is project name (Fwaulu, for example).
    foldernames = pathx.split(fld.FRONTSLASH)
    basedir = matchbasedir(foldernames)
    # Get directory under project directory.
    # _msresources, for example.
    idx = foldernames.index(basedir)
    # Directory under program directory ./ for MineSight files.
    basefolder = fld.SAMEFOLDER + fld.FRONTSLASH.join(foldernames[idx + 1:])
    return basedir, basefolder



def dealwithtoplevel(firstdir):
    """
    Compress top level files in the 
    MineSight project directory.
    
    firstdir is the three tuple returned
    from the os.walk() generator function.

    Returns two tuple of integer smallfile
    multifilecounter used for naming
    multiple file archives and splitfilestrackerfile,
    an open file object for tracking split
    files for later reconstruction.
    """
    # Top level files.
    dirx = firstdir[0] + fld.FRONTSLASH
    basedir, basefolder = getbasedir(dirx)
    # File to track split files for later glueing back together.
    splitfilestrackerfile = open(fld.SAMEFOLDER + basedir + fld.FRONTSLASH +
                                 fld.SPLITFILETRACKER, fld.W)
    firstdirfiles = segregatefiles(firstdir[0], firstdir[2])
    compresslargefiles(firstdirfiles[fld.LARGE], dirx, fld.EMPTY, basedir,
                       splitfilestrackerfile)
    compressmediumfiles(firstdirfiles[fld.MEDIUM], dirx, fld.EMPTY, basedir)
    # This is for keeping track of
    # archives with more than one file.
    multifilecounter = 1
    mulitfilecounter = compresssmallfiles(firstdirfiles[fld.SMALL], dirx,
                                          fld.EMPTY, multifilecounter, basedir)
    return multifilecounter, splitfilestrackerfile



def dealwithlowerleveldirectories(dirs, multifilecounter, splitfilestrackerfile):
    """
    Finishes out compression of lower level
    folders under top level MineSight project
    directory.

    dirs is a partially exhausted (one iteration)
    os.walk() generator.

    multifilecounter is an integer used for
    naming multiple file archives.

    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.

    Returns orphanedfolders, a list of lower level
    folders to be deleted at the end of the program
    run.
    """
    orphanedfolders = []
    for dirx in dirs:
        # XXX - hack - I hate dealing with Windows paths.
        dirn = dirx[0].replace(fld.BACKSLASH, fld.FRONTSLASH)
        diry = dirn + fld.FRONTSLASH
        basedir, basefolder = getbasedir(diry)
        # Create directory in program path.
        fauxdir = fld.PROGFOLDER[:-1] + basefolder[1:-1]
        os.mkdir(fauxdir)
        orphanedfolders.append(fauxdir)
        # Skip anything that doesn't have files.
        if not dirx[2]:
            continue
        # Easiest way to do this might be
        # to track directories and sort
        # files according to size, then
        # filter them accordingly.
        dirfiles = segregatefiles(dirx[0], dirx[2])
        compresslargefiles(dirfiles[fld.LARGE], diry, basefolder,
                           basedir, splitfilestrackerfile)
        compressmediumfiles(dirfiles[fld.MEDIUM], diry, basefolder, basedir)
        multifilecounter = compresssmallfiles(dirfiles[fld.SMALL], diry, basefolder,
                                              multifilecounter, basedir)
    splitfilestrackerfile.close()
    return orphanedfolders



def walkdir(dirx):
    """
    Traverse MineSight project directory,
    7zipping everything along the way.

    dirx is a string for the directory
    to traverse.

    Side effect function.
    """
    dirs = os.walk(dirx)
    # OK - os.walk returns generator that
    #      yields a tuple in the format
    #          (str path,
    #           [list of paths for directories under path],
    #           [list of filenames under path])

    # Top level (Fwaulu, for instance).
    # These files will not have a path
    # prefix of any sort in their respective
    # archives.
    firstdir = dirs.next()
    multifilecounter, splitfilestrackerfile = dealwithtoplevel(firstdir)
    # All other files and folders.
    orphanedfolders = dealwithlowerleveldirectories(dirs, multifilecounter,
                                                    splitfilestrackerfile)
    # Delete lower level folders first - this is necessary.
    orphanedfolders.reverse()
    for orphanx in orphanedfolders:
        print(fld.DELETINGDIR.format(orphanx))
        os.rmdir(orphanx)



def cyclefolders(folderx):
    """
    Wrapper function for compression
    of folder folderx (string).

    Side effect function.
    """
    # 1) Set up empty project directory (ex: Fwaulu)
    #    in program directory.
    # 2) For first set of files, use no prefix for
    #    7zip archive storage (filename only).
    # 3) Check for size of file.
    # 4) If file is bigger than fld.CHUNK, split.
    # 5) If file is smaller than fld.CHUNK, but bigger than
    #    MULTFILELIMIT, compress to one archive.
    # 6) If file is smaller than fld.CHUNK, and smaller than
    #    MULTFILELIMIT, check subsequent files to determine
    #    files to include in archive. Keep track of file
    #    index that puts number of bytes over limit.
    # 7) Compress multiple files to one archive - index
    #    archive to ensure unique name.
    # 8) For all following sets of files, same process,
    #    but must prefix paths with SAMEFOLDER and any
    #    additional folder names.
    foldertracker = []
    # Make directory folder in program directory
    # to hold 7zip files.
    zipfolder = getbasedir(folderx)[0]
    os.mkdir(zipfolder)
    foldertracker.append(zipfolder)
    walkdir(folderx)
    print('\nDone')



cyclefolders is the overarching wrapper function for the module (compression operation).



#!java -jar C:\jython2.7.0\jython.jar



# unsevenzipper.py



"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to un-7zip archives.
"""



# Need to adjust path to get necessary jar imports.
# XXX - it might be cleaner to chain imports by using
#       the sevenzipper (s7 alias) below to reference
#       double imported modules.  For development and
#       convenience I reimported everything as though
#       sevenzipper.py and unsevenzipper.py were separate
#       operations.
import sys
import folderstozip as fld
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)

import os

import sevenzipper as s7

import SevenZipThingExtract



def subdirectoryornot(pathx):
    """
    Boolean function that returns
    True if string pathx is a
    subdirectory of the MineSight
    project folder and False if
    the files belong directly to
    the MineSight project folder.
    """
    pathx = pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH)
    pathlist = pathx.split(fld.BACKSLASH)
    if len(pathlist) > 1:
        return True
    return False



def getdirectories(dirx):
    """
    Get list of lists of directories
    in path under project folder
    from 7zip archives in project
    folder for archives.

    Returns two tuple of list and
    dictionary indicating which
    7z files are same directory
    archives and which are archived
    subdirectory files.

    dirx is a string for the file
    path of the directory to
    be walked (./Fwaulu for example).
    """
    dirs = os.walk(dirx)
    # One level, no subfolders.
    files = dirs.next()[2]
    # Get directories first.
    rawpaths = []
    subdirornot = {}
    for filex in files:
        # Skip uncompressed split file tracker.
        if filex == fld.SPLITFILETRACKER:
            continue
        # I don't know if it's a subdirectory or not, so I'll go with False.
        s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex, dirx, False)
        folders = list(s7tx.getfolders())
        rawpaths.extend(folders)
        # All the paths in folders have the same prefix - 
        # just do one.
        subdirornot[filex] = subdirectoryornot(folders[0])
    # Get just directories
    justdirectories = [pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH).split(fld.BACKSLASH)[1:-1]
                       for pathx in rawpaths if pathx.split(fld.BACKSLASH)[1:-1]]
    justdirectories = set([tuple(x) for x in justdirectories])
    justdirectories = list(justdirectories)
    justdirectories.sort()
    return justdirectories, subdirornot



def makedirectories(dirn):
    """
    Create directory paths within archive
    project folder to accept uncompressed
    files.

    Returns subdirornot dictionary.

    dirn is a string for the file
    path of the directory to
    be walked (./Fwaulu for example).
    """
    justdirectories, subdirornot = getdirectories(dirn)
    maxdepth = max(len(dirx) for dirx in justdirectories)
    for x in xrange(0, maxdepth):
        justdirectoriesii = set([tuple(dirx[0:x + 1]) for dirx in justdirectories
                                 if len(dirx) >= x + 1])
        for diry in justdirectoriesii:
            dirw = dirn + fld.FRONTSLASH + fld.FRONTSLASH.join(diry)
            os.mkdir(dirw)
    return subdirornot

def extractfiles(dirx):
    """
    Extract files from 7z files
    in project archive folder.

    Side effect function.

    dirx is a string for the file
    path of the directory to
    be walked.
    """
    subdirornot = makedirectories(dirx)
    dirs = os.walk(dirx)
    # One level, no subfolders.
    files = dirs.next()[2]
    for filex in files:
        # Skip uncompressed split file tracker.
        if filex == fld.SPLITFILETRACKER:
            continue
        s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex,
                                    dirx, subdirornot[filex])
        s7tx.extractfiles()



def gluetogethersplitfiles(dirx):
    """
    Make split up files whole.

    Side effect function.

    dirx is the folder in which the split
    files reside.
    """
    # Glue together big files.
    # Do this in a very controlling,
    # structured way:
    # 1) Read the split file tracker csv file.
    # 2) Determine the number and names and paths
    #    of files to be reconstructed and the
    #    number of parts in each.
    # 3) Check that everything is there for
    #    each file to be reconstructed.
    # 4) Get the new relative path.
    # 5) Glue back together programmatically.
    splitfiles = []
    # fld.SPLITFILETRACKER is structured as original path
    # of file split, number of file split.
    with open(fld.SAMEFOLDERWIN + dirx +
              fld.FRONTSLASH + fld.SPLITFILETRACKER, 'r') as f:
        for linex in f:
            strippedline = [x.strip() for x in linex.split(fld.UCOMMA)]
            splitfiles.append(tuple(strippedline))
    orignames = [x[0] for x in splitfiles]
    splitoriginals = set(orignames)
    # Make dictionary that is easy to cycle through.
    filesx = {}
    for orig in splitoriginals:
        basedir, basefolder = s7.getbasedir(orig)
        filesx[orig] = {}
        filesx[orig][fld.BASEPATH] = fld.SAMEFOLDER + basedir + basefolder[1:]
        filesx[orig][fld.FILES] = (fld.SPLITFILE.format(filesx[orig][fld.BASEPATH], filex[1])
                for filex in splitfiles if filex[0] == orig)
    for orig in filesx:
        with open(filesx[orig][fld.BASEPATH], fld.WB) as mainfile:
            for filex in filesx[orig][fld.FILES]:
                with open(filex, fld.RB) as splitfile:
                    mainfile.write(splitfile.read())



def restore(dirx):
    """
    Restores MineSight project directory
    inside program path.

    dirx is a string for the directory
    to be restored (./Fwaulu, for example).

    Side effect function.
    """
    extractfiles(dirx)
    gluetogethersplitfiles(dirx)
    print('Done')



restore is the main function for the module (uncompression).



Notes:



1) I don't have admin rights at work and did not have javac (the compiler for java) available.  You can download an SDK or SRE java package from Oracle that has it.  Without admin rights, you can't install it normally.  Still you can use it.  My compilation went something like this:

<path to downloaded JDK>/bin/javac -cp <path to downloaded 7-ZipJBinding>/lib/* <myclassname>.java



2) I've left all the split up files and 7z archives in the folder where I decompress my files and recombine the split files.  This takes up a lot of space depending on what you're working with.  If space is at a premium, you probably want to write jython code to move or delete the archives after uncompressing them.



3) The most time consuming part of runtime is the compression, uncompression, and splitting and recombining of split files.  Porting some of this to java (instead of jython) might speed things up.  I code faster and generally better in jython.  Also, my objective was control, not speed.  YMMV (your mileage may vary) with this approach.  There are far better general purpose ones.

Thanks for stopping by.


Improved Storing and Displaying Images in Postgresql - bytea
2015-12-11T20:57:00.000-08:00
Last post I brute forced the storage of binary image (jpeg) data as text in a Postgresql database, and accordingly brute forced the data's display in the Unix image viewer feh from output from a psql query.  It was hackish and I received some negative, but good constructive criticism on how to improve it:



1) use Python's base64 module instead of the binascii one.



2) use bytea as a storage type in Postgresql instead of text.



Marius Gedminus made the base64.b64encode suggestion for text.  It does make for a little less storage space.  Ultimately we won't go with this solution because we want to go with bytea, the Postgresql data type intended for this type of data.  But for completeness, here is what a base64.b64encode text solution would look like:



$ python3.5 
Python 3.5.0 (default, Oct 23 2015, 21:23:18) 
[GCC 4.2.1 20070719 ] on openbsd5
Type "help", "copyright", "credits" or "license" for more information.)
>>> import base64
>>> f = open('prrrailwhaletankcar.jpg', 'rb')
>>> bindata = f.read()
>>> f.close()
>>> b64data = str(base64.b64encode(bindata))
>>> # Converted data to string for write with csv file
>>> # to database table text field.
>>> # The string representation of BASE 64 includes the
>>> # letter b and single quotes.
>>> b64data[:10]
"b'/9j/4AAQ"
>>> b64data[-10:]
"AVAH/2Q=='"
>>> b64data[1]
"'"
>>> b64data[-1]
"'"
>>> # Isolate the BASE 64 digits with the quotes included.
>>> substrx = b64data[1:]
>>> picdata = base64.b64decode(substrx)
>>> f = open('test.jpg', 'wb')
>>> f.write(picdata)
187810
>>> f.close()
>>> len(substrx)
250418
>>> # BASE 64 string is 1 1/3 times as big as the 
>>> # binary data it represents.
>>> _/187810
1.3333581811405144
>>> # Taking off the quote marks doesn't inhibit the
>>> # decoding of the BASE64 string at all - probably
>>> # best to go with this less is more approach.
subsubstrx = substrx[1:-1]
>>> picdata = base64.b64decode(subsubstrx)
>>> f = open('test2.jpg', 'wb')
>>> f.write(picdata)
187810
>>> f.close()
>>> len(picdata)
187810
>>> # BASE64 string ever so slightly smaller without
>>> # the quote marks (2 chars).
>>> len(subsubstrx)
250416
>>> _/187810
1.3333475320802939

>>> # Works in both cases.
>>> os.system('feh --geometry 400x300+200+200 test.jpg')

0
>>> os.system('feh --geometry 400x300+200+200 test2.jpg')

0
>>>



 The results for both commands in the last lines (show picture with feh) look the same:




 




Storing the BASE 64 string in a Postgresql text column is the same as storing the hex one like I did in the last post.  The main thing to look out for is the proper stripping of the Python generated string for extra characters - single quotes are OK as long as they are matched on either end of the string.  As I mentioned in the code comments above, knowing what I know now, I would strip them out too even prior to storing the string in a database.




On to the Postgresql bytea storage part of the post.  Someone I respect asked me on Facebook, "Why didn't you just use bytea (for storage)?"  I had to sheepishly own up to just not being used to working with binary data (as opposed to strings) so I went with what I knew.  Shame drove me to at least attempt to do things the right way - binary storage for binary data, in this case a jpeg image.


Postgresql 9.4 uses a hex based representation (hex format) for the bytea data type by default.  It is possible to mess this up - it is covered in the doc but I didn't read it carefully enough:



If you preface your hexadecimal string with \x (single backslash) you will end up with an octal representation of your binary data (digits 0 through 7).  \\x prior to the hexadecimal string will give you what you, or at least I want, hexidecimal representation of your binary data on output.  The SQL string I used for processing my string data (already in the database from my work on the last blog post):









/* Postgresql SQL code */

CAST('\\x' || <hexadecimal string> AS bytea)



The || operator is for concatenation of strings (this is probably obvious to Postgresql and other database distro users but MSSQL uses a + symbol so it was a little new to me).



To deal with transitioning all my text picture columns to bytea I did the following:



1) create a new set of identical tables to the ones I had in the same database with new relations identical to the old ones but with the new set of tables.



2) fill the new tables in with the new data that has all the former text columns for binary as bytea.



3) delete the old tables once the new ones are filled in.



4) rename the new tables to match the names of the old ones (how I wanted the database schema to look in the first place).







Postgresql is different than MSSQL in that the database is more its own autonomous entity that needs to be connected to other databases by some introduced mechanism.  In MSSQL, databases on the same server can reference each other in queries by default.  I started looking into the Postgresql fdw (foreign data wrapper) plugin, then realized I could do this more easily with the path I took above.



It's not necessary to post all the SQL code.  I used a psql variable in my SQL for the hexadecimal data predicate to make sure I got it right each time.  From inside psql I executed the SQL files with the \i metacommand.  Here is a snippet with the variable.


/* Postgresql SQL code to be used with

   the Postgresql psql interpreter */



/* Need this for bytea conversion
   from hex string */
\set byteaidstr '\\x'
 

INSERT INTO locomotiveprototypes2
    SELECT keyx,
           namex,
           railnamex,
           paintscheme,
           photourl,
           comments,
           CAST(:'byteaidstr' || picture AS bytea)
    FROM locomotiveprototypes;



The variable thing in psql takes a little getting used to but the Postgresql documentation is good about explaining when and how to use the single quote marks and where to put them.  It worked out.



The most important part:  getting the picture to show up from a psql metacommand through the use of a python script.  Here is my modified script similar to the one in my last post:




#!/usr/local/bin/python3.5

"""
Processing of image coming out
of Postgresql query as a stream.

Deals with bytea column string
output from psql.
"""

import base64
import sys
import subprocess

DECODED = 'decoded'

SIZEMSG = '\nsize of {0:s} output = {1:d}\n'
SIZERATIOMSG = '\nsize of {0:s} output/size of binary output = {1:05.5f}\n'

# Want to avoid '\\x' in query output.
STARTINDEX = 3

FEHCMD = ['feh', '--geometry', '400x300+200+200', '-']

BYTEAFMT = 'bytea hex format'

# 2 variables track changes in size of 
#     hex output from query in psql.
sizex = 0
lenxbin = 0

# Feeding to script straight from
# psql \copy metacommand.
inputx = sys.stdin.buffer.read()
sizex = len(inputx)

# print's are mainly for flagging when something goes wrong. 
#     aka debugging
print(inputx[:10])
print(inputx[STARTINDEX])
print(inputx[-10:])

# -1 index in slice chops off the return character '\n'
# Need casefold=True to deal with lower case from Postgresql.
binx = base64.b16decode(inputx[STARTINDEX:-1], casefold=True)
lenxbin = len(binx)

# print's highlight size relationship between
#     hex representation and actual binary data.
print(SIZEMSG.format(BYTEAFMT , sizex))
print(SIZEMSG.format(DECODED, lenxbin))
print(SIZERATIOMSG.format(BYTEAFMT, sizex/lenxbin))

# Pops up picture on screen.
subprocess.run(FEHCMD, input=binx)

print('\nDone\n')



An important change I made from last time is fixing the call to the image viewer feh to eliminate all that hacky intermediate writing of a jpeg file that took forever (in computer time).  It turns out feh accepts binary input from a pipe or stdin just fine - I just needed to read the man page more thoroughly.



Now to see if this works:



$ psql hotrains carl
Password for user carl: 
psql (9.4.4)
Type "help" for help.

hotrains=# \copy (SELECT picture FROM locomotiveprototypes WHERE keyx = 3) to program 'imageshow.py'
COPY 1
b'\\\\xffd8ffe'
102
b'8a000ffd9\n'

size of bytea hex format output = 1081720


size of decoded output = 540858


size of bytea hex format output/size of binary output = 2.00001







And we're good to go.



Thanks for stopping by.


Storing and Displaying Images in Postgresql
2015-10-17T23:36:00.002-07:00
Last post I set up a toy (literally) Postgresql database for my model train car collection.  A big part of the utility of the database is its ability to store images (pictures or photos) of the real life prototype and model train cars.  Postgresql (based on my google research) offers a couple methods of doing that.  I'll present how I accomplished this here.  The method I chose suited my home needs.  For a commercial or large scale project, something more efficient in the way of storage and speed of retrieval may be better.  Anyway, here goes.



I chose to store my photos as text representations of binary data in Postgresql database table columns with the text data type.  This decision was mainly based on my level of expertise and the fact that I am doing this for home use as part of a learning experience.  Storing the binary data as text inflates their size by a factor of two - very inefficient for storage.  For home use in a small database like mine, storage is hardly an issue.  At work I transfer a lot of binary data (3 dimensional mesh mined solids) to a distant server in text format using MSSQL's bcp.  Postgresql is a little different, but I am familiar with the general idea of stuffing a lot of text in a database column.



In order to get the data into comma delimited rows without dealing with a long, unwieldy string of text from the photos, I wrote a Python script to do it:



#!python3.4

"""
Prepare multiple rows of data
that includes a hexlify'd
picture for a column in
a table in the model train
database.
"""

import binascii
import os

UTF8 = 'utf-8'
# LATIN1 = 'latin-1'

INFOFILE = 'infoiii.csv'

PICTUREFILEFMT = '{:s}.{:s}'
ROWFILEOUTFMT = '{:s}row'

JPG = 'jpg'
PNG = 'png'

COMMA = ','

PATHX = '/home/carl/postgresinstall/workpictures/multiplecars/'

PATHXOUT = PATHX + 'rows/'

PHOTOMSG = 'Now doing photo {:s} . . .'

def checkfileextension(basename):
    """
    With the basename of an image file
    returns True for jpg and false for
    anything else (png).
    """
    if os.path.exists(PATHX +

        PICTUREFILEFMT.format(basename, JPG)):
        return True
    else:
        return False

with open(PATHX + INFOFILE, 'r', encoding=UTF8) as finfo:
    for linex in finfo:
        newlineparts = [x.strip() for x in linex.split(COMMA)]
        photox = newlineparts.pop()
        print(PHOTOMSG.format(photox))
        # Check for jpg or png here
        # XXX - this could be better - could actually
        #       check and return actual extension;
        #       more code and lazy.
        extension = ''
        if checkfileextension(photox):
            extension = JPG
        else:
            extension = PNG
        with open(PATHX +

            PICTUREFILEFMT.format(photox,

                extension), 'rb') as fphoto:
            contents = binascii.hexlify(fphoto.read())
            liney = COMMA.join(newlineparts)
            liney += COMMA
            liney = bytes(liney, UTF8)
            liney += contents
            with open(PATHXOUT +

                ROWFILEOUTFMT.format(photox), 'wb') as frow:
                frow.write(liney)

print('\nDone\n')



The basic gist of the script is to get each photo name provided into a file that can be later imported into a table in Postgresql.  The paths in the capitalized "constants" would have to be adjusted for your situation (I tend to go overboard on capitalized constants because I'm a lousy typist and want to avoid screwing up and then having to debug my typos).  The INFOFILE referred to in the script has roughly the following format:



<column1data>, <column2data>, . . . , <photofilename>



So the idea is to take a comma delimited file, encode it in UTF-8, and stuff the binary data from the (correct) photo at the end as text.  I designed my database tables with photos (I use the column name "picture") with the text data column as the last - this is kind of a hack, but it made scripting this easier.



An example of importing one of these "row" files into the database table from within psql:



$ psql hotrains carl
Password for user carl:
psql (9.4.1)
Type "help" for help.

hotrains=# \d
                List of relations
 Schema |          Name          | Type  | Owner 
--------+------------------------+-------+-------
 public | rollingstockprototypes | table | carl
(1 row)

hotrains=# \d rollingstockprototypes
     Table "public.rollingstockprototypes"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 namex    | character varying(50)  | not null
 photourl | character varying(150) | not null
 comments | text                   | not null
 picture  | text                   | not null
Indexes:
    "rsprotoname" PRIMARY KEY, btree (namex)

hotrains=# COPY rollingstockprototypes FROM '/home/carl/postgresinstall/G39Arow' (DELIMITER ',');

COPY 1



My Python script for actually displaying a photo or image is a little hacky in that in requires checks for the size of the output versus the size of the information pulled from the Postgresql database table.  My original script would show the picture piped to the lightweight UNIX image viewer feh as partially complete.  In order to get around this I put a timed loop in the script to check that the image data were about half of the size of the text data pulled.  It works well enough, if slowly at times:



#!/usr/local/bin/python3.4

"""
Try to mimic processing of image
coming out of postgresql query
as a stream.
"""

import binascii
import os
import time
import sys

import argparse

# Name of file containing psql \copy hex output (text).
HEXFILE = '/home/carl/postgresinstall/workpictures/hexoutput'

# 2.5 seconds max delay before abort.
# Enough time to write most big pixel
# jpg's, it appears.
MAXTIME = 2.5
PAUSEX = 0.25

# Argument name.
PICTURENAME = 'picturename'

parser = argparse.ArgumentParser()
parser.add_argument(PICTURENAME)
args = parser.parse_args()
print(args.picturename)

# Name of picture file
# written from hex query.
PICNAME = args.picturename

# Extensions feh recognizes.
PNG = 'png'
JPG = 'jpg'

FILEEXTENSIONMSG = '\nFile extension {:s} detected.\n'
UNRECOGNFILENAME = '\nUnrecognized file extension for picture '

UNRECOGNFILENAME += '{:s}\n'
ABORTMSG = '\nSorry, no data available for feh.  Aborting.\n'

SLEEPMSG = '\nSleeping {:2.2f} seconds . . .\n'

SIZEHEXFILEMSG = '\nsize of hex output = {:d}\n'
SIZEBINARYMSG = '\nsize of binary file = {:d}\n'
SIZERATIOMSG = '\nsize of hex output/size of binary file '

SIZERATIOMSG += '{:05.5f}\n'

ACCEPTABLEHEXTOBINRATIO = 1.99
ABORTMSGTOOSMALL = '\nSorry, not enough data to show a '

ABORTINGTOOSMALL += 'complete picture.  Aborting.\n'

extension = PICNAME[-3:]
if extension == PNG:
    print(FILEEXTENSIONMSG.format(PNG))
elif extension == JPG:
    print(FILEEXTENSIONMSG.format(JPG))
else:
    print(UNRECOGNFILENAME.format(extension))
    print(ABORTMSG)
    sys.exit()

PICFILEFMT = '/home/carl/postgresinstall/workpictures/{:s}'
FEHFMT = 'feh -g 400x300+200+200 {:s}'

# Length of binary string.
lenx = 0
# 2 variables track changes in size of 
# hex output from query in psql.
sizex = 0
sizexnew = 0
# Tracks time spent sleeping.
totaltimewait = 0.0

while totaltimewait < MAXTIME:
    # Try to make sure hex file is completely written.
    sizexnew = os.path.getsize(HEXFILE)
    if sizexnew > sizex or sizexnew == 0:
        sizex = sizexnew
        print(SLEEPMSG.format(PAUSEX))
        time.sleep(PAUSEX)
        totaltimewait += PAUSEX
    elif sizexnew == sizex:
        with open(HEXFILE, 'rb') as f2:
            with open(PICFILEFMT.format(PICNAME), 'wb') as f:
                strx = binascii.unhexlify(f2.read().strip())
                lenx = len(strx)
                print(SIZEHEXFILEMSG.format(sizexnew))
                print(SIZEBINARYMSG.format(lenx))
                print(SIZERATIOMSG.format(sizexnew/lenx))
                f.write(strx)
        break

# I don't want part of a picture.
if not (sizexnew > 0 and
        sizexnew/lenx > ACCEPTABLEHEXTOBINRATIO):
    print(ABORTMSGTOOSMALL)
    sys.exit()

# Pops up picture on screen.
os.system(FEHFMT.format(PICFILEFMT.format(PICNAME)))

print('\nDone\n')



Let's see if we can get a look at this in action - example of call from within psql:



hotrains=# \copy (SELECT decode(picture, 'hex') FROM rollingstockprototypes WHERE namex = 'G-39A Ore Jenny') to program 'cat > /home/carl/postgresinstall/workpictures/hexoutput | imageshowiii.py'
COPY 1
hotrains=#



And a screenshot of a (hopefully acceptable) result:







Depending on which directory I've logged into psql under, I may have to type the full paths of the output and Python file.



There is more I could do with this, but for now I'm OK with it.  Writing to a file and then checking on its size is slow.  There is probably a way to write to memory and check what's there, but I got stuck on that and decided to go with the less efficient solution.



Thanks for stopping by.


Setting Up Toy Postgresql Database on OpenBSD
2015-10-17T21:53:00.000-07:00
This isn't a Python scripting post, but the next one will be on the same topic.  In this post I get a Postgresql database set up on my OpenBSD laptop and get familiar with the Postgresql environment.



I primarily use Microsoft SQL Server and vendor supplied database schemas at work.  I know Postgresql has a good reputation among open source databases, but I haven't had an opportunity to use it in a work environment (I had a brief brush with Jigsaw years back - a competitor to Modular's MSSQL-based Powerview (Dispatch) in pit mining truck tracking database - but that doesn't count.)



Anyway, as I've noted in previous posts, I run OpenBSD as my operating system on my laptop at home.  The OpenBSD project has a package for Postgresql.



The first order of business is to install the Postgresql server package.  First, I'll set up a PKG_PATH  FTP mirror location from within the ksh shell:



$ export PKG_PATH=ftp://ftp3.usa.openbsd.org/pub/OpenBSD/5.7/packages/i386/



That ftp3.usa.openbsd.org server is the one in Boulder, Colorado - that's the one I usually use.  I'm in Tucson, Arizona in the Mountain timezone, so it kind of makes sense to use that one.  My understanding is that, in general, you want to use a mirror away from the main one to spread out the bandwidth and server use for the OpenBSD (or any other open source) project.



Now to install the package - this has to be done as root.  I use sudo for this (sudo's replacement, as I understand it, in OpenBSD 5.8 will be doas(1) although you'll still be able to get sudo(1) as a package).



$ echo $PKG_PATH 

 ftp://ftp3.usa.openbsd.org/pub/OpenBSD/5.7/packages/i386/



$ sudo pkg_add postgresql-server
quirks-2.54 signed on 2015-03-09T11:04:08Z
No change in quirks-2.54
postgresql-server-9.4.1p1 (extracting)
1%
1%
2%
3% ********



<etc.>



100%
postgresql-server-9.4.1p1 (installing)
0% useradd: Warning: home directory `/var/postgresql' doesn't exist, and -m was not specified
postgresql-server-9.4.1p1 (installing)|
1%
1%
2%
3% ********



<etc.> 



100%


postgresql-server-9.4.1p1: ok
The following new rcscripts were installed: /etc/rc.d/postgresql
See rcctl(8) for details.
Look in /usr/local/share/doc/pkg-readmes for extra documentation.
$



Given an internet connection with decent speed, this all goes pretty quickly.  The first set of per cent numbers are the download of the gzippped tar package binary, the second are the unzipping and install of the Postgresql binaries in the proper location in the operating system file hierarchy.



For years I had some trouble getting my head around setting up users for Postgresql and running the daemon.  Much of my database experience is as an application user at work using Microsoft SQL Server.  We use Windows Authentication there primarily.  Working on my own UNIX-based (OpenBSD) home system is a little different.



Most of the problems I've had overcoming this user/security hump related to my lack of a good strong grasp of UNIX users and permissions (like I could do it in my sleep strong grasp).  OpenBSD is a bit unique in that it has a special name for the postgresql unprivileged user:  _postgresql.  The underscore is a convention in OpenBSD for this general class of user, usually associated with a daemon that runs on startup or gets started by root, doesn't have a login (nor a password).  Michael Lucas spends several pages with a good summary of the rational behind this, the history and its conventions in his authoritative OpenBSD book.











So, we want to take a look at the directory designated for Postgresql's data, /var/postgresql:


$ ls -lah /var | grep post



drwxr-xr-x   2 _postgresql  _postgresql   512B May 19 17:52 postgresql



$ cd postgresql



There is no data directory there (just . and .. in the /var/postgresql directory - the 2 in the ls output).  This is typically where I would get stuck in the past.  I ended up doing it manually . . . and wrong, or at least in a way that was more difficult than necessary.  Anyway, I recorded it that way, so I'll blog it as executed.

What I had difficulty understanding before was the whole unprivileged user concept.  Basically you need to use su to log on as root, then further su to log on as _postgresql:

# THIS IS AN UNNECESSARY STEP - DON'T DO THIS



$ su
Password:
# su - _postgresql
$ mkdir /var/postgresql/data
$ ls -lah /var/postgresql
total 12
drwxr-xr-x   3 _postgresql  _postgresql   512B Jun  4 19:06 .
drwxr-xr-x  23 root         wheel         512B May 19 17:52 ..
drwxr-xr-x   2 _postgresql  _postgresql   512B Jun  4 19:06 data
$ exit
# exit
$ 



# END UNNECESSARY STEP



Now I need a database cluster.  I want to initialize it with support for UTF-8 because I have some text data with umlauts in it (non-ASCII):



$ su
Password:
# su - _postgresql
$ initdb -D /var/postgresql/data -U postgres -A md5 -E UTF8 -W


The files belonging to this database system will be owned by user "_postgresql".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 30
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/postgresql/data/base/1 ... ok
initializing pg_authid ... ok
Enter new superuser password: 
Enter it again: 
setting password ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... not supported on this platform
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    postgres -D /var/postgresql/data
or
    pg_ctl -D /var/postgresql/data -l logfile start

$ exit
# exit
$ whoami
carl
$ pwd
/home/carl


A couple things:

1) There's a line in the output about fixing permissions on the existing data directory (this will show up as highlighted on the blog, possibly not in the planetpython blog feed) - had I done this correctly (just let initdb make the directory itself), that line would look something like this (I created another cluster while writing the blog just so I would understand how to do it right):

creating directory /var/postgresql/data4 ... ok 



Right there in the initdb(1) man page:  "Creating a database cluster consists of creating the directories in which the database date will live . . ."  The man page goes on to explain how to get around permission problems, etc. in this process.  Note to self:  read the man page . . . carefully.



2) What I also learned is that you can make as many database clusters as you want, all with different data directories.  postgres is the superuser name you see in the documentation and /var/postgresql/data is the directory, but, as demonstrated above in the output, you could put your data in a folder called data4.  If you gave a different name at the -U switch in the initdb command, the superuser name would be different too.  Or you could have more than one cluster with postgres named superusers but with different passwords.



All that said, one cluster per physical box and the conventional names are plenty for me - I'm just trying to get used to the Postgresql environment and get started.



At this point I need to start up the Postgresql daemon.  In the package install above, the output mentions an rc script /etc/rc.d/postgresql.  This is run by root - below is a demo of using it manually with su (instead of using it as part of an rc startup sequence at boot):



$ su
Password:

# /etc/rc.d/postgresql start 
postgresql(ok)
# pgrep postgres
6960
10175
4748
29053
32758
26201
# /etc/rc.d/postgresql stop                                                    
postgresql(ok)
# pgrep postgres 



All I did there was start the Postgresql daemon with the installed rc script, check to see that it's associated processes are running, then stop the daemon with the same script.



Me being me, I can't leave good enough alone.  I want the control of starting and stopping the daemon when I decide to (I am running this on a laptop).  As I understand it, pg_ctl is a wrapper program provided with the Postgresql install for even more low level commands and functionality.  I use pg_ctl to run the daemon and start it with the _postgresql user account:



$ su
Password: 

# su - _postgresql
$ pg_ctl -D /var/postgresql/data -l firstlog start
server starting
$ exit
# exit
$



I asked pg_ctl to make a specific log file for this session (firstlog - this will go in directory /var/postgresql/).  The logs are human readable and I wanted to study them later to see what's going on (there's all kinds of stuff in there about autovacuum and what not - sorry, we're not covering that in this blog post - but I'll have it available later).



Shutting down (stopping) the daemon is pretty simple with pg_ctl - just a few more keystrokes than if I had done it from root with the rc script:



$ su
Password:
# su - _postgresql
$ pg_ctl -D /var/postgresql/data stop
waiting for server to shut down.... done
server stopped
$ exit
# exit
$ whoami
carl
$   



Great - so I'm good for getting the daemon going when I want it and for designating my own specific log files per session.  Now to create a user and get to work:



(with daemon running):



$ psql -U postgres
Password for user postgres:
psql (9.4.1)
Type "help" for help.

postgres=# CREATE ROLE carl SUPERUSER;
CREATE ROLE
postgres=# ALTER ROLE carl PASSWORD 'xxxxxxxx'
ALTER ROLE
postgres=# ALTER USER carl PASSWORD 'xxxxxxxx' LOGIN;
ALTER ROLE
postgres=# \q

$



I created a user/role carl with SUPERUSER capabilities within this instance of Postgresql.  It's a bit ugly and I'm not sure I've done this correctly or the easiest way.  Also, and of importance, I have given Postgresql user carl (not OpenBSD user carl) all permissions on everything.  Really, carl only needs permissions to work on the database he's working on.  Josh Drake (@linuxhiker on twitter) pointed this out to me.  I am grateful for this.  He is right.  I am lazy.



Now to create my database.  I got into model trains around Christmas of 2015 and went crazy collecting stuff and setting up a layout.  I needed to somehow keep track of all the cars before it all got too unwieldy.


$ psql postgres carl
Password for user carl:
psql (9.4.1)
Type "help" for help.

postgres=# CREATE DATABASE hotrains;
CREATE DATABASE
postgres=# \q
$  



The command line entry to start psql is something I'm a bit fuzzy on - postgres isn't, to the best of my knowledge, a database per se, but a means of connecting to psql when you don't want to designate a default database ot work on.



I'm not going to post the full database code for the sake of brevity - it's only 11 tables but that's a bit much for a blog post.  Instead I'll post a graphic schema I made and talk to it a little bit before posting one related SQL code snippet. 


Disclaimer:  I'm not a designer.  This schema diagram I did with Dia, a fairly lightweight Linux/UNIX desktop tool for flowcharts and stuff.  I've never met a color palette or font choice I could simply let be.  Asking me to do a flowchart with a lot of leeway on design is like leaving a two year old home alone with a Crayola 64 pack of crayons and the 300 year old family Bible - it can't end well.

All that said, I find schema diagrams helpful for showing relationships between tables and having an ugly one is better than none at all.  I've embedded an svg version of it below; hopefully it shows up on the planetpython feed:









The focus of my crude toy database design was the use of foreign keys to maintain consistency in naming things I want to track:  rail name for example.  I went with "Santa Fe" where I could have went with (and probably should have) "ATSF."  It doesn't matter as long as it's consistent and I know what it means.



Years ago I was called in to do some work on a blasting database at the mine.  There weren't any constraints on the entry of names of blasting materials, but what could go wrong?  There were only three or four products with four digit designators and "None."  Well . . . it was a mess.  I didn't want to take any chances on having a situation like that again, even, or especially, if I was doing all of the data entry.  Foreign keys it was!



Here's a quick dump of the code I used to create the validsidenumbers table.  The idea is to make sure the rail line or company name is consistent in all the side number records (yes, I did actually purchase some identical rolling stock with the exact same side numbers - it's a long story):

hotrains=# CREATE TABLE validsidenumbers (
railnamex        varchar(50) REFERENCES validraillines (namex),
sidenumber       varchar(50),
comments         text,
PRIMARY KEY (railnamex, sidenumber)
);
CREATE TABLE
hotrains=# 



That REFERENCES keyword sees to it that I won't enter anything typo'd or goofy into that railnamex column.



Next post is a Python one about pulling storing images of the train cars in the database and displaying them from within psql.

Thanks for stopping by.


MSSQL sqlcmd -> bcp csv dump -> Excel
2015-09-26T19:56:00.000-07:00
A couple months back I had a one-off assignment to dump some data from a vendor provided relational database to a csv file and then from there to Excel (essentially a fairly simple ETL - extract, transform, load exercise).  It was a little trickier than I had planned it.  Disclaimer:  this may not be the best approach, but it worked . . . at least twice . . . on two different computers and that was sufficient.



Background:



Database:  the relational database provided by the vendor is the back end to a graphic mine planning application.  It does a good  job of storing geologic and mine planning data, but requires a little work to extract the data via SQL queries.  



Weighted Averages:  specifically, the queries are required to do tonne-weighted averages and binning.  Two areas that I've worked in, mine planning and mineral processing (mineral processing could be considered a subset of metallurgy or chemical engineering), require a lot of work with weighted averages.  Many of the database programming examples on line deal with retail and focus one sales in the form of sum of sales by location.  The weighted average by tonnes or gallons of flow requires a bit more SQL code.



Breaking Up the SQL and the CSV Dump Problem:  in order to break the weighted average and any associated binning into smaller, manageable chunks of functionality, I used MSSQL (Microsoft SQL Server) global temporary tables in my queries.  Having my final result set in one of these global temporary tables allowed me to dump it to a csv file using the MSSQL bcp utility.  There are other ways to get a result set and produce a csv file from it with Python.  I wanted to isolate as much functionality within the MSSQL database as possible.  Also, the bcp utility gives some feedback when it fails - this made debugging or troubleshooting the one off script easier, for me, at least.



As far as the SQL goes, I may have been able to do this with a single query without too much trouble.  There are tools within Transact-SQL for pivoting data and doing the sort of things I naively and crudely do with temporary tables.  That said, in real life, the data are seldom this simple and this clean.  There are far more permutations and exceptions.  The real life version of this problem has fourteen temporary tables versus the four shown here.



Sanitized Mock Up Scenario:  there's no need to go into depth on our vendor's database schema or the specific technical problem - both are a tad complicated.  I like doing tonne-weighted averages with code but it's not everyone's cup of tea.  In the interest of simplifying this whole thing and making it more fun, I've based it on the old Star Trek Episode Devil in the Dark about an underground mine on a distant planet.


















Mock Data:  we're modeling mined out areas and associated tonnages of rock bearing pergium, gold, and platinum in economic concentrations.  (I don't know what pergium is, but it was worth enough that going to war with Mother Horta seemed like a good idea).  Here is some code to create the tables and fill in the data (highly simplified schema - each mined out area is a "cut").


SQL Server 2008 R2 (Express) - table creation and mock data SQL code .  I'm not showing the autogenerated db creation code - it's lengthly - suffice it to say the database name is JanusVIPergiumMine.  Also, there are no keys in the tables for the sake of simplicity.



USE JanusVIPergiumMine;





CREATE TABLE cuts (
    cutid INT,
    cutname VARCHAR(50),
    monthx VARCHAR(30),
    yearx INT);




CREATE TABLE cutattributes (
    cutid INT,
    attributex VARCHAR(50),
    valuex VARCHAR(50));




CREATE TABLE tonnes(
    cutid INT NULL,
    tonnes FLOAT);




CREATE TABLE dbo.gradesx(
 cutid int NULL,
 gradename varchar(50) NULL,
 gradex float NULL);




DELETE FROM cuts;




INSERT INTO cuts
    VALUES (1, 'HappyPergium1', 'April', 2015),
           (2, 'HappyPergium12', 'April', 2015),
           (3, 'VaultofTomorrow1', 'April', 2015),
           (4, 'VaultofTomorrow2', 'April', 2015),
           (5, 'Children1', 'April', 2015),
           (6, 'Children2', 'April', 2015),
           (7, 'VandenbergsFind1', 'April', 2015),
           (8, 'VandenbergsFind2', 'April', 2015);




DELETE FROM cutattributes;




INSERT INTO cutattributes
    VALUES (1, 'Drift', 'Level23East'),
           (2, 'Drift', 'Level23East'),
           (3, 'Drift', 'Level23West'),
           (4, 'Drift', 'Level23West'),
           (5, 'Drift', 'BabyHortasCutEast'),
           (6, 'Drift', 'BabyHortasCutEast'),
           (7, 'Drift', 'BabyHortasCutWest'),
           (8, 'Drift', 'BabyHortasCutWest');




DELETE FROM tonnes;




INSERT INTO tonnes
    VALUES (1, 28437.0),
           (2, 13296.0),
           (3, 13222.0),
           (4, 6473.0),
           (5, 6744.0),
           (6, 8729.0),
           (7, 10030.0),
           (8, 2345.0);




DELETE FROM gradesx;




INSERT INTO gradesx
    VALUES (1, 'Au g/tonne', 6.44),
           (1, 'Pt g/tonne', 0.54),
           (1, 'Pergium g/tonne', 15.23),
           (2, 'Au g/tonne', 7.83),
           (2, 'Pt g/tonne', 0.77),
           (2, 'Pergium g/tonne', 4.22),
           (3, 'Au g/tonne', 0.44),
           (3, 'Pt g/tonne', 3.54),
           (3, 'Pergium g/tonne', 2.72),
           (4, 'Au g/tonne', 0.87),
           (4, 'Pt g/tonne', 2.87),
           (4, 'Pergium g/tonne', 1.11),
           (5, 'Au g/tonne', 12.03),
           (5, 'Pt g/tonne', 0.33),
           (5, 'Pergium g/tonne', 10.01),
           (6, 'Au g/tonne', 8.72),
           (6, 'Pt g/tonne', 1.38),
           (6, 'Pergium g/tonne', 5.44),
           (7, 'Au g/tonne', 7.37),
           (7, 'Pt g/tonne', 1.59),
           (7, 'Pergium g/tonne', 4.05),
           (8, 'Au g/tonne', 3.33),
           (8, 'Pt g/tonne', 0.98),
           (8, 'Pergium g/tonne', 3.99);




Python Code to Run the Dump/ETL to CSV:  this is essentially a series of os.system calls to MSSQL's sqlcmd and bcp.  What made this particularly brittle and hairy is the manner in which the lifetime of temporary tables is determined in MSSQL.  To get the temporary table with my results to persist, I had to wrap its creation inside a process.  I'm ignorant as to the internal workings of buffers and memory here, but the MSSQL sqlcmd commands do not execute or write to disk exactly when you might expect them to.  Nothing is really completed until the process hosting sqlcmd is killed.




At work I actually got the bcp format file generated on the fly - I wasn't able to reproduce this behavior for this mock exercise.  Instead, I generated a bcp format file for the target table dump "by hand" and put the file in my working directory.




As I show further on, this SQL data dump will be run from a button within an Excel spreadsheet.




Mr. Spock, or better said, Horta Mother says it best:



Subprocesses, sqlcmd, bcp, Excel . . .

PAAAAAIIIIIIIN!












#!C:\Python34\python




# blogsqlcmdpull.py




# XXX
# Changed my laptop's name to MYLAPTOP.

# Yours will be whatever your computer

# name is.




import os
import subprocess as subx
import shlex
import time

import argparse




# Need to make sure you are in proper Windows directory.
# Can vary from machine to machine based on
# environment variables.

# Googled StackOverflow.
# 5137497/find-current-directory-and-files-directory
EXCELDIR = os.path.dirname(os.path.realpath(__file__))
os.chdir(EXCELDIR)
print('\nCurrent directory is {:s}'.format(os.getcwd()))




parser = argparse.ArgumentParser()
# 7 digit argument like 'Apr2015'
# Feed in at command line
parser.add_argument('monthyear',
    help='seven digit, month abbreviation (Apr2015)',
    type=str)
args = parser.parse_args()
MONTHYEAR = args.monthyear




# Use Peoplesoft/company id so that more than
# one user can run this at once if necessary
# (note:  will not work if one user tries to
#         run multiple instances at the same
#         time - theoretically <not tested>
#         tables will get mangled and data
#         will be corrupt.)
USER = os.getlogin()




CSVDUMPNAME = 'csvdumpname'
CSVDUMP = 'nohandjamovnumbersbcp'

CSVEXT = '.csv'

HOMESERVERNAME = 'homeservername'
LOCALSERVER = r'MYLAPTOP\SQLEXPRESS'

USERNAME = 'username'




# Need to fill in month, year
# with input from Excel spreadsheet.
QUERYDICT = {'month':"'{:s}'",
             'year':0,
             USERNAME:USER}




# For sqlcmd and bcp
ERRORFILENAME = 'errorfilename'
STDOUTFILENAME = 'stdoutfilename'
ERRX = 'sqlcmderroutput.txt'
STDOUTX = 'sqcmdoutput.txt'
EXIT = '\nexit\n'
UTF8 = 'utf-8'
GOX = '\nGO\n'




# 2 second pause.
PAUSEX = 2

SLEEPING = '\nsleeping {pause:d} seconds . . .\n'




# XXX - Had to generate this bcp format file
#       from table in MSSQL Management Studio -
#       dos command line:
# bcp ##TARGETX format nul -f test.fmt -S MYLAPTOP\SQLEXPRESS -t , -c -T


# XXX - you can programmatically extract
#       column names from the bcp format
#       file or
#       you can dump them from SQLServer
#       with a separate query in bcp - 
#       I have done neither here
#       (I hardcoded them).
FMTFILE = 'formatfile'
COLBCPFMTFILE = 'bcp.fmt'




CMDLINEDICT = {HOMESERVERNAME:LOCALSERVER,
               'exit':EXIT,
               CSVDUMPNAME:CSVDUMP,
               ERRORFILENAME:ERRX,
               STDOUTFILENAME:STDOUTX,
               'go':GOX,
               USERNAME:USER,
               'pause':PAUSEX,
               FMTFILE:COLBCPFMTFILE}




# Startup for sqlcmd interactive mode.
SQLPATH = r'C:\Program Files\Microsoft SQL Server'
SQLPATH += r'\100\Tools\Binn\SQLCMD.exe'
SQLCMDEXE = [SQLPATH]
SQLCMDARGS = shlex.split(

    ('-S{homeservername:s}'.format**CMDLINEDICT)),
         posix=False)
SQLCMDEXE.extend(SQLCMDARGS)




BCPSTR = ':!!bcp "SELECT * FROM ##TARGETX{username:s};" '
BCPSTR += 'queryout {csvdumpname:s}.csv -t , '
BCPSTR += '-f {formatfile:s} -S {homeservername:s} -T'
BCPSTR = BCPSTR.format(**CMDLINEDICT)




def cleanslate():
    """
    Delete files from previous runs.
    """
    # XXX - only one file right now.
    files = [CSVDUMP + CSVEXT]
    for filex in files:
        if os.path.exists(filex) and os.path.isfile(filex):
            os.remove(filex)
    return 0




MONTHS = {'Jan':'January',
          'Feb':'February',
          'Mar':'March',
          'Apr':'April',
          'May':'May',
          'Jun':'June',
          'Jul':'July',
          'Aug':'August',
          'Sep':'September',
          'Oct':'October',
          'Nov':'November',
          'Dec':'December'}




def parseworkbookname():
    """
    Get month (string) and year (integer)
    from name of workbook (Apr2015).

    Return as month, year 2 tuple.
    """
    # XXX
    # Write this out - will eventually 
    # need error checking/try-catch
    monthx = MONTHS[MONTHYEAR[:3]]
    yearx = int(MONTHYEAR[3:])
    return monthx, yearx




# Global Temporary Tables
TONNESTEMPTBL = """
CREATE TABLE ##TONNES{username:s} (
    yearx INT,
    monthx VARCHAR(30),
    cutid INTEGER,
    drift VARCHAR(30),
    tonnes FLOAT);
"""




FILLTONNES = """
USE JanusVIPergiumMine;




DECLARE @DRIFT CHAR(5) = 'Drift';

INSERT INTO ##TONNES{username:s}
    SELECT cutx.yearx,
           cutx.monthx,
           cutx.cutid,
           cutattrx.valuex AS drift,
           tonnesx.tonnes
    FROM cuts cutx
        INNER JOIN cutattributes cutattrx
            ON cutx.cutid = cutattrx.cutid
        INNER JOIN tonnes tonnesx
            ON cutx.cutid = tonnesx.cutid
    WHERE cutx.yearx = {year:d} AND
          cutx.monthx = {month:s} AND
          cutattrx.attributex = @DRIFT;
"""




GRADESTEMPTBL = """
CREATE TABLE ##GRADES{username:s} (
    cutid INTEGER,
    drift VARCHAR(30),
    gradenamex VARCHAR(50),
    graden FLOAT);
"""




FILLGRADES = """
USE JanusVIPergiumMine; 

DECLARE @DRIFT CHAR(5) = 'Drift';

INSERT INTO ##GRADES{username:s}
    SELECT cutx.cutid,
           cutattrx.valuex AS drift,
           gradesx.gradename,
           gradesx.gradex
    FROM cuts cutx
        INNER JOIN cutattributes cutattrx
            ON cutx.cutid = cutattrx.cutid
        INNER JOIN gradesx
            ON cutx.cutid = gradesx.cutid
    WHERE cutx.yearx = {year:d} AND
          cutx.monthx = {month:s} AND
          cutattrx.attributex = @DRIFT;
"""




# Sum and tonne-weighted averages
MONTHLYPRODDATASETTEMPTBL = """
CREATE TABLE ##MONTHLYPRODDATASET{username:s} (
    yearx INT,
    monthx VARCHAR(30),
    drift VARCHAR(30),
    tonnes FLOAT,
    gradename VARCHAR(50),
    grade FLOAT);
"""




FILLMONTHLYPRODDATASET = """
INSERT INTO ##MONTHLYPRODDATASET{username:s}
    SELECT tonnesx.yearx,
           tonnesx.monthx,
           tonnesx.drift,
           SUM(tonnesx.tonnes) AS tonnes,
           gradesx.gradenamex AS gradename,
           SUM(tonnesx.tonnes * gradesx.graden)/
           SUM(tonnesx.tonnes) AS graden
    FROM ##TONNES{username:s} tonnesx
        INNER JOIN ##GRADES{username:s} gradesx
            ON tonnesx.cutid = gradesx.cutid
    GROUP BY tonnesx.yearx,
             tonnesx.monthx,
             tonnesx.drift,
             gradesx.gradenamex;
"""




# Pivot
TARGETXTEMPTBL = """
CREATE TABLE ##TARGETX{username:s} (
    yearx INT,
    monthx VARCHAR(30),
    drift VARCHAR(30),
    tonnes FLOAT,
    pergium FLOAT,
    Au FLOAT,
    Pt FLOAT);
"""




FILLTARGETX = """
DECLARE @PERGIUM CHAR(15) = 'Pergium g/tonne';
DECLARE @GOLD CHAR(10) = 'Au g/tonne';
DECLARE @PLATINUM CHAR(10) = 'Pt g/tonne';

INSERT INTO ##TARGETX{username:s}
    SELECT mpds.yearx,
           mpds.monthx,
           mpds.drift,
           MAX(mpds.tonnes) AS tonnes,
           MAX(perg.grade) AS pergium,
           MAX(au.grade) AS Au,
           MAX(pt.grade) AS Pt
    FROM ##MONTHLYPRODDATASET{username:s} mpds
        INNER JOIN ##MONTHLYPRODDATASET{username:s} perg
            ON perg.drift = mpds.drift AND
            perg.gradename = @PERGIUM
        INNER JOIN ##MONTHLYPRODDATASET{username:s} au
            ON au.drift = mpds.drift AND
            au.gradename = @GOLD
        INNER JOIN ##MONTHLYPRODDATASET{username:s} pt
            ON pt.drift = mpds.drift AND
            pt.gradename = @PLATINUM
    GROUP BY mpds.yearx,
             mpds.monthx,
             mpds.drift
    ORDER BY mpds.drift;
"""




# 1) Create global temp tables.
# 2) Fill global temp tables.
# 3) Get desired result set into the target global temp table.
# 4) Run bcp against target global temp table.
# 5) Drop global temp tables.
CREATETABLES = {1:TONNESTEMPTBL,
                2:GRADESTEMPTBL,
                3:MONTHLYPRODDATASETTEMPTBL,
                4:TARGETXTEMPTBL}

FILLTABLES = {1:FILLTONNES,
              2:FILLGRADES,
              3:FILLMONTHLYPRODDATASET,
              4:FILLTARGETX}




def getdataincsvformat():
    """
    Retrieve data from MSSQL server.
    Dump into csv text file.
    """
    numtables = len(CREATETABLES)
    with open('{errorfilename:s}'.format(**CMDLINEDICT), 'w') as e:
        with open('{stdoutfilename:s}'.format(**CMDLINEDICT), 'w') as f:
            sqlcmdproc = subx.Popen(SQLCMDEXE, stdin=subx.PIPE,
                    stdout=f, stderr=e)
            for i in range(numtables):
                cmdx = (CREATETABLES[i + 1]).format(**QUERYDICT)
                print(cmdx)
                sqlcmdproc.stdin.write(bytes(cmdx +
                    '{go:s}'.format(**CMDLINEDICT), UTF8))
                print(SLEEPING.format(**CMDLINEDICT))
                time.sleep(PAUSEX)
            for i in range(numtables):
                cmdx = (FILLTABLES[i + 1]).format(**QUERYDICT)
                print(cmdx)
                sqlcmdproc.stdin.write(bytes(cmdx +
                    '{go:s}'.format(**CMDLINEDICT), UTF8))
                print(SLEEPING.format(**CMDLINEDICT))
                time.sleep(PAUSEX)
            print('bcp csv dump command (from inside sqlcmd) . . .')
            sqlcmdproc.stdin.write(bytes(BCPSTR, UTF8))
            print(SLEEPING.format(**CMDLINEDICT))
            time.sleep(PAUSEX)
            sqlcmdproc.stdin.write(bytes('{exit:s}'.format(**CMDLINEDICT), UTF8))
    return 0

          

monthx, yearx = parseworkbookname()


# Get rid of previous files.
print('\ndeleting files from previous runs . . .\n')
cleanslate()


# Get month and year into query dictionary.
QUERYDICT['month'] = QUERYDICT['month'].format(monthx)
QUERYDICT['year'] = yearx


getdataincsvformat()


print('done')




It's ugly, but it works.

Keeping with the Horta theme, this would be a good spot for an image break:






Damnit, Jim, I'm a geologist not a database programmer.

You're an analyst, analyze.




Load to Excel:  this is fairly straightforward - COM programming with Mark Hammond and company's venerable win32com.  The only working version of the win32com library I had on my laptop on which I am writing this blog entry was for a Python 2.5 release that came with an old version of our mine planning software (MineSight/Hexagon) - the show must go on!




#!C:\MineSight\mpython



# blognohandjamnumberspython2.5.py



# mpython is Python 2.5 on this machine.

# Had to remove collections.namedtuple
# (used dictionary instead) and new
# string formatting (reverted to use
# of ampersand for string interpolation).

# Lastly, did not have argparse at my

# disposal.



from __future__ import with_statement



"""
Get numbers into spreadsheet
without having to hand jam
everything.
"""



import os
from win32com.client import Dispatch

# Plan on receiving Excel file's
# path from call from Excel workbook.


import sys



# Path to Excel workbook.
WB = sys.argv[1]
# Worksheet name.
WSNAME = sys.argv[2]



BACKSLASH = '\\'


# Looking for data file in current directory.
# (same directory as Python script)
CSVDUMP = 'nohandjamovnumbersbcp.csv'


# XXX - repeated code from data dump file.
CURDIR = os.path.dirname(os.path.realpath(__file__))
os.chdir(CURDIR)
print('\nCurrent directory is %s' % os.getcwd())


# XXX - I think there's a more elegant way to
#       do this path concatenation with os.path.
CSVPATH = CURDIR + BACKSLASH + CSVDUMP



# Fields in csv dump.
YEARX = 'yearx'
MONTHX = 'monthx'
DRIFT = 'drift'
TONNES = 'tonnes'
PERGIUM = 'pergium'
GOLD = 'Au'
PLATINUM = 'Pt'



FIELDS = [YEARX,
          MONTHX,
          DRIFT,
          TONNES,
          PERGIUM,
          GOLD,
          PLATINUM]



# Excel cells.
# Map this to csv dump and brute force cycle to fill in.
ROWCOL = '%s%d'

COLUMNMAP = dict((namex, colx) for namex, colx in
        zip(FIELDS, ['A', 'B', 'C', 'D',
            'E', 'F', 'G']))



EXCELX = 'Excel.Application'



def getcsvdata():
    """
    Puts csv data (CMP dump) into
    a list of data structures
    and returns list.
    """
    with open(CSVPATH, 'r') as f:
        records = []
        for linex in f:
            # XXX - print for debugging/information
            print([n.strip() for n in linex.split(',')])
            records.append(dict(zip(FIELDS,
                (n.strip() for n
                    in linex.split(',')))))
    return records



# Put Excel stuff here.
def getworkbook(workbooks):
    """
    Get handle to desired workbook
    """
    for x in workbooks:
        print(x.FullName)
        if x.FullName == WB:
            # XXX - debug/information print statement
            print('EUREKA')
            break
    return x



def fillinspreadsheet(records):
    """
    Fill in numbers in spreadsheet.

    Side effect function.

    records is a list of named tuples.
    """
    excelx = Dispatch(EXCELX)
    wb = getworkbook(excelx.Workbooks)
    ws = wb.Worksheets.Item(WSNAME)
    # Start entering data at row 4.
    row = 4
    for recordx in records:
        for x in FIELDS:
            column = COLUMNMAP[x]
            valuex = recordx[x]
            cellx = ws.Range(ROWCOL % (column, row))
            # Selection makes pasting of new value visible.
            # I like this - not everyone does.  YMMV
            cellx.Select()
            cellx.Value = valuex
        # On to the next record on the next row.
        row += 1
    # Come back to origin of worksheet at end.
    ws.Range('A1').Select()
    return 0
                
cmprecords = getcsvdata()
fillinspreadsheet(cmprecords)

print('done')



On to the VBA code inside the Excel spreadsheet (macros) that execute the Python code:

Option Explicit



Const EXECX = "C:\Python34\python "
Const EXECXII = "C:\MineSight\mpython\python\2.5\python "
Const EXCELSCRIPT = "blognohandjamnumberspython2.5.py "
Const SQLSCRIPT = "blogsqlcmdpull.py "



Sub FillInNumbers()



    Dim namex As String
    Dim wb As Workbook
    Dim ws As Worksheet
    
    Dim longexecstr As String
    
    Set ws = Selection.Worksheet
    'Try to get current worksheet name to feed values to query.
    namex = ws.Name
    
    longexecstr = EXECXII & " " & ActiveWorkbook.Path
    longexecstr = longexecstr & Chr(92) & EXCELSCRIPT
    longexecstr = longexecstr & ActiveWorkbook.Path & Chr(92) & ActiveWorkbook.Name
    longexecstr = longexecstr & " " & namex

    VBA.Interaction.Shell longexecstr, vbNormalFocus
    
End Sub



Sub GetSQLData()

    Dim namex As String
    Dim ws As Worksheet
    
    Set ws = Selection.Worksheet
    'Try to get current worksheet name to feed values to query.
    namex = ws.Name

    VBA.Interaction.Shell EXECX & ActiveWorkbook.Path & _
        Chr(92) & SQLSCRIPT & namex, vbNormalFocus
    
End Sub



I always use Option Explicit in my VBA code - that's not particularly pythonic, but being pythonic inside the VBA interpreter can be hazardous.  As always, YMMV.

Lastly, a rough demo and a data check.  We'll run the SQL dump from the top button on the Excel worksheet:













And now we'll run the lower button to put the data into the spreadsheet.  It's probably worth noting here that I did not bother doing any type conversions on the text coming out of the SQL csv dump in my Python code.  That's because Excel handles that for you.  It's not free software (Excel/Office) - might as well get your money's worth.






We'll do a check on the first row for tonnes and a pergium grade.  Going back to our original data:



Cuts 1 and 2 belong to the drift Level23East.



Tonnes:



VALUES (1, 28437.0),
       (2, 13296.0),



Total:  41733



Looks good, we know we got a sum of tonnes right.  Now the tonne-weighted average:



Pergium:



(1, 'Pergium g/tonne', 15.23),
(2, 'Pergium g/tonne', 4.22),



(28437 * 15.23 + 13296 * 4.22)/41733 = 11.722



It checks out.  Do a few more checks and send it out to the Janus VI Pergium Mine mine manager.

Notes:



This is a messy one-off mousetrap.  That said, this is often how the sausage gets made in a non-programming, non-professional development environment.  We do have an in-house Python developer Lori.  Often she's given something like this and told to clean it up and make it into an in-house app.  That's challenging.  Ideally, the mining professional writing the one-off and the dev get together and cross-educate vis a vis the domain space (mining) and the developer space (programming, good software design and practice).  It's a lot of fun but the first go around is seldom pretty.



Thanks for stopping by.











Leonard Nimoy

1931 - 2015


Lenovo Thinkpad X201 Fan Replacement
2015-05-17T08:34:00.000-07:00
This is not a Python-related post per se, but it may be useful to people getting started with UNIX-based, open source software, or even a Windows user who happens to be using a Thinkpad X201 laptop.

Background:



1) I use OpenBSD as my operating system because I am striving to learn UNIX and I find that distro the best for me for that purpose.















2) The venerable legacy IBM/current Lenovo Thinkpad line of laptops tends to be one of the best supported by OpenBSD and other BSD development communities (small, but loyal dev and user base).



3) I buy my Thinkpads refurb'd because they're cheaper that way.



4) Laptop parts only last so long before they start failing, more so with refurbished ones.  It was easy to replace the hard drive; the fan is a bit more complicated in terms of disassembling the laptop.



5) I'm a bit mechanically challenged and tend to break things permanently when trying to fix them.  This post hopefully will serve to help others overcome this lack of confidence and fear.



There is actually a really good step by step still frame photo series on the web about how to take apart a Thinkpad X201 and replace the fan.  I used that extensively during this task:

http://www.myfixguide.com/manual/lenovo-thinkpad-x201-disassembly-clean-cooling-fan-remove-keyboard/



A walkthrough of my experience and a few notes:











Mr. Dexter's Star Wars joke about philips head screws (actually bolts) notwithstanding, stripping those little guys is a problem.  I was lucky this time.  In my model train adventures, I've been less so.

There are few things more annoying than a deep seated, little phillips head bolt or screw.  The thought of taking a power drill to a laptop to extract one of these makes me a bit nervous.  Fortunately, just before Radioshack went bankrupt a few months back, I found a nice set of long shaft phillips head screwdrivers in Tucson in one of their stores.  Those tools have been indispensible.

















This is the laptop after I got the fan hooked up.  There is a forum on the internet from a few years back where someone is asking how to test the fan.  People just kept trolling him and laughing at him.  Here is how it's done.  Basically you hook everything up (in my case I only needed the power, screen, and keyboard) without actually putting the laptop back together (those zillion phillips head bolts!) and boot up.  It's hard to see, but the fan is happily whirring away over there on the left.

It's weird operating on a machine you're used to having in one piece - reminiscent of those scenes in STTNG where they take apart LCDR Data ("Data and Commander Riker are in engineering examining Data's head.")

Heating up:






















You don't want to test the computer too long in this state (without the fan in the proper place and the machine put back together).  Thinkpads and the X201 in particular are notorious for running hot.  You can see from the screen that the machine is heating up at about a degree Centigrade in the time it takes me to type in the next sysctl command/query.

sudo shutdown -hp now

I put it all back together and the only (well, not really - see below) thing different was a slight nick on the keyboard:










Hey!  Where did these "extra" screws (bolts) come from?!  Uh-oh . . .





. . . and the sound doesn't work either - looks like I was a bit too hasty in putting this thing back together.  We'll give it another try . . .





. . . it looks like some of those extra bolts hold that important piece of aluminum in place . . .













. . . and a few more over here . . .





. . . uh, OK, that's my problem with the sound :-(

That sound card connection is paper thin - I think that's why it has to be secured with a little snap-in thingy in the picture.  Hardware is pretty amazing sometimes.  To all my electrical engineering friends:  I bow to you.











This time I test the laptop again, but for sound.  Doing what computer techs do (taking apart laptops and fixing them)



I tell you folks

It's harder than it looks



(Sorry, had to).

After I got all the screws (bolts) put back in (save 4 - I have no idea and I'm leaving good enough alone), I still (thought) I had a problem with the sound.  It turns out I've been through this before.  On UNIX-based systems the X201 mute key works funkily.  This link explains it a bit for a Debian Linux system.  Whether OpenBSD is different under the hood or not, the behavior to the user is essentially the same:


http://www.stderr.nl/Blog/Hardware/Thinkpad/WeirdMuteButtonBehaviour.html



And I was good to go!

Hope this helps someone like me.

Thanks for stopping by.













IE and Getting a Text File Off the Web - Selenium Web Tools
2015-03-26T09:41:00.002-07:00
I've blogged previously about getting information off of a distant server on my employer's internal SharePoint site.  Automating this can be a little challenging, especially when there's a change.

My new desktop showed up with Internet Explorer 11 and Windows 7 Enterprise.  When I went to run my MineSight multirun (basically a batch file with a GUI front end that our mine planning vendor provides) the file fetch from our SharePoint site didn't work.  A little googling led me to Selenium.

As is often the case, I am wayyyy late to the party here.  I remember Selenium from Pycon 2010 in Atlanta because they gave us a nice mug with new string formatting on it that I use frequently (both the mug and the formatting):






 

I was at Pycon 2010 . . . and I have the mug to prove it.



 

My project manager/boss at the time, Eric, seeing me gush over the string formatting commands, did his usual button-pushing exercise by commenting, "I don't know; why didn't they put something on there like 'from pot import coffee'?"  People, y'know?

Back to Selenium - I was able to get what I needed from it with some research and downloading.  The steps are basically:



 

    1) Download IEDriverServer.exe

 

    2) Put the executable in a location in your path.

 

    3) Download Python Selenium Bindings and follow the install instructions.  I went the Python 3.4 route (versus the Python 2.7 that comes with MineSight) - personal preference on my part.

    4) Make sure your Internet Explorer environment/application is set up in a way that won't cause you problems.  I could try to describe this, but this blog post from a Selenium developer does it so much better (complete with screenshots):  http://jimevansmusic.blogspot.com/2012/08/youre-doing-it-wrong-protected-mode-and.html.  When Microsoft talks about "zones" and IE Protected Mode, the zones refer to things like "Trusted Sites," company web, external internet, etc. - all those have to be set to protected mode or things won't work and you'll get a fairly cryptic error message when the script crashes.



For my example, I was able to comment out some of the things I need to do within the MineSight multirun.  The DOS window hangs and IEDriverServer stays open within the MineSight multirun and app - I hacked this problem by killing it with an os.system() call.  Whatever it takes.

 

I couldn't efficiently get the script to recognize HTML tag names, so I hacked that with text processing.  This is bad, but effective.

 

The code:

 

#!C:\Python34\python

 

"""
Get text from site via Internet Explorer.
"""

 

INST = 'instructions.txt'

 

# For killing process inside Multirun.
# import os

 

from time import sleep as slpx

 

from selenium import webdriver

 

# XXX - hack - had difficulty getting
#       things by tag - text processed it.
PRETAG = '<pre>'
PRETAGLEN = len(PRETAG)


PRETAGCLOSE = '</pre>'

# Seconds to pause at end.
PAUSE = 3

INSTRUCTIONS = 'http://ftp3.usa.openbsd.org/pub/OpenBSD/5.6/README'
INSTR = 'instructions.txt'

 

# XXX - may not matter (\r versus \n), in all cases
#       but for numbers in multirun, makeshift chomp
#       processing made a difference.


RETCHAR = '\r'

 

# Hack to shutdown DOS window.

# TASKKILL = 'taskkill /im IEDriverServer.exe /F'

 

def getbody(url):
    """
    Given the website address (url),
    returns inner HTML text stripped of tags.
    """
    browser = webdriver.Ie()
    browser.get(url)
    text = browser.page_source
    browser.close()
    text = text[(text.index(PRETAG) + PRETAGLEN):]
    text = text[:(text.index(PRETAGCLOSE))]
    text = text.split(RETCHAR)
    [x.strip() for x in text]
    return text

 

textii = getbody(INSTRUCTIONS)
print('\nDealing with writing of instructions file . . .\n')
textii = ''.join(textii)
f = open(INSTR, 'w')
f.write(textii)
f.close()

print('Instructions copied.')

print('\nPausing {:d} seconds . . .\n'.format(PAUSE))
slpx(PAUSE)


# XXX - can't get window to close in Multirun (MXPERT) - CBT 23MAR2015
# os.system(TASKKILL)

 

 




Polygon Offset Using Vector Math in IronPython
2014-11-25T21:06:00.000-08:00


The other day I saw a something retweeted by @leppie (I think) about an experimental hyper-fast vector math driven 3D engine for the dot Net Framework.  This led me to investigate whether there is a default implementation of vector math in the dot Net Framework.  As it turns out, there is.



This is of interest because (I think) this would make IronPython the only Python implementation that has vector math included without having to install a third party library.  Java has a utils.Vector object, but it has nothing to do with vector math (it's a specialized array).  You do need to use the dot Net Framework instead of standard Python modules, but if you're running IronPython, you should have access to that anyway.



The whole, or at least a big part of the idea of running a Python implementation against the dotNet Framework is that you can leverage the power of that big library collection with a language that's fairly dense, easy, and doesn't require compilation.



This was pretty easy on Windows.  The only confusing part is that there are two namespaces in dot Net called System.Windows.  You want the one that references the WindowsBase dll.  This is the one that has our Vector object in it.



The code (including the plotting by Gnuplot - I had to download the Windows version; I did leave out the monastery.py file with the original shape points in it; also, the writetofile.py file is almost exactly like the one from the previous post except that for a Vector object, the x and y names are capitalized):



# vecipy.py



"""
Polygon offset problem using
dot Net Framework.
"""

import clr

WINX = 'WindowsBase'

clr.AddReference(WINX)

from System.Windows import Vector

import math
import copy

import monastery as pic

OFFSET = 0.15

def scaleadd(origin, offset, vectorx):
    """
    From a Vector representing the origin,
    a scalar offset, and a Vector, returns
    a Vector object representing a point 
    offset from the origin.

    (Multiply vectorx by offset and add to origin.)
    """
    # Multiply method that takes scalar and Vector.
    multx = Vector.Multiply(vectorx, offset)
    return Vector.Add(multx, origin)

def getinsetpoint(pt1, pt2, pt3):
    """
    Given three points that form a corner (pt1, pt2, pt3),
    returns a point offset distance OFFSET to the right
    of the path formed by pt1-pt2-pt3.
    
    pt1, pt2, and pt3 are two tuples.
    
    Returns a Vector object.
    """
    origin = Vector(*pt2)
    v1 = Vector(pt1[0] - pt2[0], pt1[1] - pt2[1])
    v1.Normalize()
    
    v2 = Vector(pt3[0] - pt2[0], pt3[1] - pt2[1])
    v2.Normalize()
    
    v3 = copy.copy(v1)

    v1 = Vector.CrossProduct(v1, v2)

    v3 = Vector.Add(v3, v2)
    v3.Normalize()
    
    # In dotNet - Vector.Multiply is overloaded.
    # When it gets two Vector objects as arguments
    # it returns a dot product.
    cs = Vector.Multiply(v3, v2)
    
    # Again multiplication is overloaded.
    # Here it gets a scalar and a Vector
    # as arguments.
    a1 = Vector.Multiply(cs, v2)
    a2 = Vector.Subtraction(v3, a1)
    
    if cs > 0:
        alpha = math.sqrt(a2.LengthSquared)
    else:
        alpha =- math.sqrt(a2.LengthSquared)
    
    if v1 < 0.0:
        return scaleadd(origin, -1.0 * OFFSET/alpha, v3)
    else:
        return scaleadd(origin, OFFSET/alpha, v3)

def generatepoints():
    """
    Create list of offset points
    for points inset from polygon.

    Return list.
    """
    polyinset = []
    lenpolygon = len(pic.MONASTERY)
    i = 0
    poly = pic.MONASTERY
    while i < lenpolygon - 2:
        polyinset.append(getinsetpoint(poly[i], 
                     poly[i + 1], poly[i + 2]))
        i += 1
    polyinset.append(getinsetpoint(poly[-2], 
                 poly[0], poly[1]))
    polyinset.append(getinsetpoint(poly[0], 
                 poly[1], poly[2]))

    return polyinset





# writetofile.py





"""
Write vector points to file.

Show in gnuplot.
"""

import vecipy as vecx
import os

# We're using gnuplot.
# It doesn't like commas, so
# we'll use whitespace (6).
FMT = '{0:30.28f}      {1:30.28f}'
FILEX = 'points'
ORIGSHAPE = 'originalshape'

PLOTCMD = 'set xrange[0.0:6.0]\n'
PLOTCMD += 'set yrange[0.0:6.0]\n'
PLOTCMD += 'plot "{0:s}" with lines lt rgb "red" lw 4, '
PLOTCMD += '"{1:s}" with lines lt rgb "blue" lw 4'
GNUPLOTFILE = 'plotfile'
GNUPLOT = 'gnuplot -p {:s}'.format(GNUPLOTFILE)

pts = vecx.generatepoints()
f = open(FILEX, 'w')
i = 1
for ptx in pts:
    print('Printing point {0:d} . . .'.format(i))
    print >> f, FMT.format(ptx.X, ptx.Y)
    i += 1
f.close()

# Plot original as well.
i = 0
f = open(ORIGSHAPE, 'w')
for ptx in vecx.pic.MONASTERY:
    print('Printing point {0:d} of original shape . . .'.format(i))
    print >> f, FMT.format(*ptx)
    i += 1
f.close()

f = open(GNUPLOTFILE, 'w')
print >> f, PLOTCMD.format(ORIGSHAPE, FILEX)
f.close()
os.system(GNUPLOT)




The result (shown in previous post):









I run OpenBSD on my laptop at home.  So I would be using mono in my cross-platform experiment. 



Microsoft just recently (Fall 2014) announced the open sourcing of the dotNet Framework and cross platform capability for it.  The mono project responded very positively to this announcement.  I would imagine this as being good news for IronPython too.



OpenBSD has a package for mono.  From there, I just needed to download the IronPython binaries and run mono against them, or so I thought . . .



As it turns out, my script kept crashing on the overloaded Vector.Multiply method - NotImplementedError.  I tried to research things, wasn't having any luck, and brute forced the problem by wrapping the method in a class in C# class I called vecx:



Note (26NOV2014):  I hacked this C# module up a bit too quickly and didn't have performance or elegance in mind.  If you declare those Multiply methods as static you can save yourself the trouble of instantiating a new instance of the class each time you want to call them.  In fact, you can do the same thing with all the Vector methods you want to use (Add, CrossProduct, etc.).  I was just too hurried and too lazy.  CBT 



using System;

public class vecx
{

  public System.Windows.Vector vectorx;

  public vecx()
  {
    System.Windows.Vector vectorx = new System.Windows.Vector(0.0, 0.0);
    this.vectorx = vectorx;
  }

  public vecx(double x, double y)
  {
    System.Windows.Vector vectorx = new System.Windows.Vector(x, y);
    this.vectorx = vectorx;
  }

  public Double Multiply(System.Windows.Vector a, System.Windows.Vector b)
  {
    return System.Windows.Vector.Multiply(a, b);
  }

  public System.Windows.Vector Multiply(Double a, System.Windows.Vector b)
  {
    return System.Windows.Vector.Multiply(a, b);
  }

  public System.Windows.Vector Multiply(System.Windows.Vector a, Double b)
  {
    return System.Windows.Vector.Multiply(a, b);
  }

}







The command line (your paths will probably be different) text for compiling this under mono was:



$ mcs -r:/usr/local/lib/mono/4.5/WindowsBase.dll -target:library vecx.cs 



The code using this faux Vector class was a little bit different (and hackish):



"""
Polygon offset problem using
dot Net Framework.

Modified for use with mono.
"""

import clr

# Hacked C# module.
VECX = '/home/carl/vectormath/IronPython/mono/vecx.dll'

clr.AddReference(VECX)

import vecx

import math
import copy

import monastery as pic

OFFSET = 0.15

def scaleadd(origin, offset, vectorx):
    """
    From a Vector representing the origin,
    a scalar offset, and a Vector, returns
    a Vector object representing a point 
    offset from the origin.

    (Multiply vectorx by offset and add to origin.)
    """
    # Generic vector for use of Vector type.
    vecgeneric = vecx().vectorx

    # Multiply method that takes scalar and Vector.
    # Using cs module compiled to dll for Multiply
    # methods in mono.
    multx = vecx().Multiply(vectorx, offset)
    return vecgeneric.Add(multx, origin)

def getinsetpoint(pt1, pt2, pt3):
    """
    Given three points that form a corner (pt1, pt2, pt3),
    returns a point offset distance OFFSET to the right
    of the path formed by pt1-pt2-pt3.
    
    pt1, pt2, and pt3 are two tuples.
    
    Returns a Vector object.
    """
    # Generic vector for use of type.
    vecgeneric = vecx().vectorx

    origin = vecx(*pt2).vectorx
    v1 = vecx(pt1[0] - pt2[0], pt1[1] - pt2[1]).vectorx
    v1.Normalize()
    
    v2 = vecx(pt3[0] - pt2[0], pt3[1] - pt2[1]).vectorx
    v2.Normalize()
    
    v3 = copy.copy(v1)

    v1 = vecgeneric.CrossProduct(v1, v2)

    v3 = vecgeneric.Add(v3, v2)
    v3.Normalize()
    
    # In dotNet - Vector.Multiply is overloaded.
    # When it gets two Vector objects as arguments
    # it returns a dot product.
    # Using cs module compiled to dll for Multiply
    # methods in mono.
    cs = vecx().Multiply(v3, v2)
    
    # Again multiplication is overloaded.
    # Here it gets a scalar and a Vector
    # as arguments.
    # Using cs module compiled to dll for Multiply
    # methods in mono.
    a1 = vecx().Multiply(cs, v2)
    a2 = vecgeneric.Subtract(v3, a1)
    
    if cs > 0:
        alpha = math.sqrt(a2.LengthSquared)
    else:
        alpha =- math.sqrt(a2.LengthSquared)
    
    if v1 < 0.0:
        return scaleadd(origin, -1.0 * OFFSET/alpha, v3)
    else:
        return scaleadd(origin, OFFSET/alpha, v3)

def generatepoints():
    """
    Create list of offset points
    for points inset from polygon.

    Return list.
    """
    polyinset = []
    lenpolygon = len(pic.MONASTERY)
    i = 0
    poly = pic.MONASTERY
    while i < lenpolygon - 2:
        polyinset.append(getinsetpoint(poly[i], 
                     poly[i + 1], poly[i + 2]))
        i += 1
    polyinset.append(getinsetpoint(poly[-2], 
                 poly[0], poly[1]))
    polyinset.append(getinsetpoint(poly[0], 
                 poly[1], poly[2]))

    return polyinset



Any port in a storm or whatever it takes, as they say.



Thanks again to Mr. Rafsanjani whom I referenced in my previous post.  His methodology and detection of a former bug got me back on track.



And thank you for stopping by.







Polygon Offset With pyeuclid Revisited
2014-11-16T00:09:00.001-08:00
A few years back I did two or three posts on polygon offset.  It was a 
learning experience that I never quite completed to my satisfaction.  A 
kind visitor to my last post on the subject, Mr. Ahmad Rafsanjani,
 actually rewrote some of my code in a comment.  I gave him a polite 
weasel answer thanking him, but dropped the effort and never felt quite 
right about it.



Well, as the saying goes, better late than never.  He was quite correct in his assessment, but my understanding of vector math was not strong enough to prove this to myself.  I was visually inspecting the results, and, given what I was dealing with at the time, they seemed OK. 



Here is the picture we're trying to get (this is with Mr. Rafsanjani's code, but the difference with mine and the original code, although wrong, is not great):















In order to nail down the discrepancy in my original code, I inserted some print statements with a lot of numeric precision (28 digits to the right of the decimal) in the output: 



$ more points
1.2231671842700024832595318003      1.7024195134850139687898717966
2.1231671842700023944416898303      1.7024195134850139687898717966
2.2768328157299975167404681997      2.5475804865149860312101282034
1.6635803619063778135966913396      2.5493839809701555054743948858
1.7364196380936223196300716154      3.3506160190298444057077631442
2.5205825797292722434406186949      3.3529128986463621053815131745
2.6794174202707274901058553951      4.1470871013536383387076966756
2.1360193516544989655869812850      4.1228847880778562995374159073



(etc.)



The numbers highlighted in yellow are mismatches in the Y-coordinates of points of the inset offset polygon - each pair of Y coordinates should represent lines parallel to the X axis; in other words, they should be equal.  I have a bug.



Contrast that with the numbers yielded by Mr. Rafsanjani's code:



$ more points
1.2251864530113494300422871675      1.7000000000000001776356839400
2.1251864530113491191798402724      1.7000000000000001776356839400
2.2797319075568038826418160170      2.5499999999999998223643160600
1.6642549229616445671808833140      2.5499999999999998223643160600
1.7369821956889173186766583967      3.3500000000000000888178419700
2.5229705854077835169846366625      3.3500000000000000888178419700
2.6829705854077836590931838145      4.1500000000000003552713678801
2.1880983342360056376207921858      4.1500000000000003552713678801
2.6780983342360054066944030637      4.8499999999999996447286321199
3.1219016657639944156699129962      4.8499999999999996447286321199

(etc.)



Much better.  Lines that are supposed to be perfectly parallel to the X axis are, at least to 28 decimal places precision and the limits of my platform and the C Python interpreter, parallel to the X axis.  For what I am doing, I can more than live with that.



I've included Mr. Rafsanjani's comments in the code.  My modifications to his code were mainly for the purpose of printing some things out and organizing the polygon offset part of this exercise into a module.

I've made a separate main script for gnuplot.  After not looking at everything for three years I realized I had forgotten everything I ever knew about gnuplot and wanted to record it this time.  The file with the 20 points for the shape (monastery.py) is available on request.



Here is the main pyeuclid/polygon offset part of the code (rafsanjanicorrection.py):



"""
Polygon offset problem using
pyeuclid and incorporating corrections
made by Ahmed Rafsanjani.
"""

# Mr. Rafsanjani's comments:

# I think there is a small bug:

# In "getinsetpoint", the vector v3 should be
# normalized before passing to "scaleadd".

# Furthermore, the final offset is not as the
# prescribed OFFSET and the angle between 
# vectors should be taken into account.

# A possible solution could be:

import euclid as eu
import math
import copy

import monastery as pic

OFFSET = 0.15

def scaleadd(origin, offset, vectorx):
    """
    From a vector representing the origin,
    a scalar offset, and a vector, returns
    a Vector3 object representing a point 
    offset from the origin.

    (Multiply vectorx by offset and add to origin.)
    """
    multx = vectorx * offset
    return multx + origin

def getinsetpoint(pt1, pt2, pt3):
    """
    Given three points that form a corner (pt1, pt2, pt3),
    returns a point offset distance OFFSET to the right
    of the path formed by pt1-pt2-pt3.
    
    pt1, pt2, and pt3 are two tuples.
    
    Returns a Vector3 object.
    """
    origin = eu.Vector3(pt2[0], pt2[1], 0.0)
    v1 = eu.Vector3(pt1[0] - pt2[0], pt1[1] - pt2[1], 0.0)
    v1.normalize()
    
    v2 = eu.Vector3(pt3[0] - pt2[0], pt3[1] - pt2[1], 0.0)
    v2.normalize()
    
    v3 = copy.copy(v1)
    v1 = v1.cross(v2)
    v3 += v2
    v3.normalize()
    
    cs = v3.dot(v2)
    
    a1 = cs * v2
    a2 = v3 - a1
    
    if cs > 0:
        alpha = math.sqrt(a2.magnitude_squared())
    else:
        alpha =- math.sqrt(a2.magnitude_squared())
    
    if v1.z < 0.0:
        return scaleadd(origin, -1.0 * OFFSET/alpha, v3)
    else:
        return scaleadd(origin, OFFSET/alpha, v3)

def generatepoints():
    """
    Create list of offset points
    (pyeuclid.Vector3 objects) for
    points inset from polygon.

    Return list.
    """
    polyinset = []
    lenpolygon = len(pic.MONASTERY)
    i = 0
    poly = pic.MONASTERY
    while i < lenpolygon - 2:
        polyinset.append(getinsetpoint(poly[i], 
                     poly[i + 1], poly[i + 2]))
        i += 1
    polyinset.append(getinsetpoint(poly[-2], 
                 poly[0], poly[1]))
    polyinset.append(getinsetpoint(poly[0], 
                 poly[1], poly[2]))

    return polyinset



The file that prints stuff out and summons gnuplot (writtofile.py):



"""
Write vector points to file.

Show in gnuplot.
"""

# import blogpost as vecx
import rafsanjanicorrection as vecx
import os

# We're using gnuplot.
# It doesn't like commas, so
# we'll use whitespace (6).
FMT = '{0:30.28f}      {1:30.28f}'
FILEX = 'points'
ORIGSHAPE = 'originalshape'

PLOTCMD = 'set xrange[0.0:6.0]\n'
PLOTCMD += 'set yrange[0.0:6.0]\n'
PLOTCMD += 'plot "{0:s}" with lines lt rgb "red" lw 4, '
PLOTCMD += '"{1:s}" with lines lt rgb "blue" lw 4'
GNUPLOTFILE = 'plotfile'
GNUPLOT = 'gnuplot -p {:s}'.format(GNUPLOTFILE)

pts = vecx.generatepoints()
f = open(FILEX, 'w')
i = 1
for ptx in pts:
    print('Printing point {0:d} . . .'.format(i))
    print >> f, FMT.format(ptx.x, ptx.y)
    i += 1
f.close()
# Plot original as well.
# XXX - repetetive - make function.
i = 0
f = open(ORIGSHAPE, 'w')
for ptx in vecx.pic.MONASTERY:
    print('Printing point {0:d} of original shape . . .'.format(i))
    print >> f, FMT.format(ptx[0], ptx[1])
    i += 1
f.close()

f = open(GNUPLOTFILE, 'w')
print >> f, PLOTCMD.format(ORIGSHAPE, FILEX)
f.close()
os.system(GNUPLOT) 



pyeuclid, to the best of my knowledge, runs only in Python 2.7 at the moment.  In any case, I got an error on the Python 3.4 install with setup.py so I stuck with 2.7.



Thanks to Mr. Rafsanjani for his help with this and for the rest of you for stopping by.





















MeetBSD California 2014 Recap
2014-11-03T10:56:00.000-08:00






I am returning from MeetBSD in San Jose, California.  This isn't a Python-related post per se, but the BSD family of operating systems maintains packages and ports for Python and Python third party libraries, and use of Python on these systems is significant both in the open source development and commercial spheres.



The structure of the conference is a brief weekend unconference.  Nonetheless some of the talks were more than worthy of a full fledged mega-con, and the rest were quality.  It was a good deal.



Venue:  the conference was held at Western Digital.   WD sells a variety of hardware.  The product they were pushing was a several terabyte little box that updates wirelessly (but not by Bluetooth). 



We met in a rectangular conference room.  All of Silicon Valley seems to me to be an endless office park with nice weather and some landscaped spots (I've included the obligatory Strelizia/bird of paradise pic from the conference hotel entrance below).  It was a fairly intimate setting.  The food (a variety of sandwiches) was good.  We were warned ahead of time that Wifi was limited; I brought my own Verizon jetpack unit so it wasn't an issue for me.







Talks (that I attended): 



1) Rick Reed, “WhatsApp: Half a billion unsuspecting FreeBSD users” - Erlang and FreeBSD at WhatsApp used for scaling.  Now 600,000 users.  It was a good talk, but I wasn't awake and some of it went over my head.



2) Jordan Hubbard, “FreeBSD: The Next 10 Years” Good talk; I hated it :-(



Hubbard's leaving Apple a couple years ago and signing on with iXSystems (a sponsor and essentially the organizer of this conference) made a big splash.  He is an accomplished dev and a good guy by all accounts.  His ideas are on many levels very valid in every sense.



I am primarily an OpenBSD user.  I run FreeBSD on my RPi and on a spare laptop for easy access to Java.  The two OS's have similar philosophies in some respects (correctness, BSD license, etc.).  There is cross-polination when it comes to operating system components, apps, and drivers.  But where OpenBSD unapologetically maintains new releases for older hardware and uncompromisingly adheres to its leader's approach to security and development, FreeBSD in the framework of Hubbard's talk is looking more towards the future and making changes to attract younger talented core committers and target more modern (read mobile) platforms.  Telemetry, scrapping development on older platforms "ruthlessly," getting younger devs involved by providing work that's interesting to them - all this stuff is important for FreeBSD going forward.  At one point he even <gasp> suggested systemd as a good strategy for Linux that FreeBSD should, at least in principle if not in form, emulate.



FreeBSD is everywhere - or at least in a lot of places companies just don't make a big deal of.  Inside cable (connections) was the one example.  In order to accomodate mobile and embedded environments, the OS, although well suited to these platforms now, needs to change.



A lot of this in my mind goes against OpenBSD's philosophy - purity and security at all costs.  My personal philosophy lies with the OpenBSD approach, but I may well be wrong.  Hubbard is a guy with a lot of industry know how and experience and I am a geologist who uses OpenBSD.  He is probably right, but I don't want my fun to stop, so I'm sticking with OpenBSD even if death awaits us . . .



3) David Maxwell, "The Unix command pipeline - using Unix in the renewable energy era"







I always liked Maxwell.  He's a Canadian guy and a NetBSD devotee.



His talk was about a command line app he's putting together for better tracking piped commands on the UNIX command line and reproducing, referencing, and inspecting them retroactively in a way that's easier than what you have to do now.  I think it's got potential and would like to see it succeed.



After the angst I felt over Hubbard's talk, this was a welcome relief.  The UNIX command line is something everyone, or most everyone at the con knows and loves.  Everyone uses piped commands.  This is a useful approach to a common problem - that's something we can all agree on.  My favorite talk of the conference (that I attended).





4) Alex Rosenberg, "Meet PlayStation 4"













By far and away the coolest talk.  Rosenberg presented this well and spoke honestly and as openly as he could as a member of a big commercial project about specifics.  Games require so much optimization at such a low level.  Although this theme came up in a number of the talks, on the PlayStation project it's critical.  Essentially, the best hardware and hardware architecture for the project is selected for a given product lifecycle (10 years?  IIRC) then you hammer at it with software modifications to get every last bit of efficiency out of it.



It's not like there's a standard laptop install of FreeBSD on PlayStation 4 and you let it rip with your happy traditional UNIX OS. They're optimizing LLVM and clang (the compiler and linkers), talking directly to the metal as much as possible, and just generally nailing performance at the lowest level of the architecture (after they've gotten the low hanging fruit up top, of course).



Another theme that came up in almost all the talks, but especially in this one, was the BSD license.  Granted, it was a BSD conference, so organizers and attendees have a bias.  Nonetheless, it appears that licensing is really critical in the decision to adopt open source software and operating systems.  "business friendly" nowadays often has "capitalism at its worst" overtones, still, it was a theme:  the BSD license is the "business friendly" one whereas the GPL, particularly the GPL3, is not . . . 



I'm not a gamer, but I enjoyed this.  Rosenberg is really easy to talk to as well.  He let me take that pic up close when we were posing for the group pic after his talk.



5) Brendan Gregg, "Performance Analysis"











Gregg works for Netflix.  He's written a lot of dtrace scripts (including numerous Python ones) and has them readily available on Github.



I found myself wishing I knew more about the subject, because performance monitoring is a really cool netadmin problem when, like Netflix, you're dealing with huge bandwidth challenges (as in other talks, so much comes down to optimization).



That said, Gregg presented some graphical tools that are useful (I'll get the names wrong, so I won't try) - basically histogram-like, color coded performance charts with labels for processes.  You don't have to run your own netflix to benefit from these and he's made everything open source and available.  If I were a netadmin I would jump on this.  I've got to get smarter first before I can benefit from these tools.



Gregg has a soft British accent and a very amiable demeanor.  He was the first talk in the morning.  It was like a lullabye.  This is one I need to revisit on the videos posted online because it's worth it.





6) Corey Vixie, "Web Apps on Embedded BSD..."













The iXSystems surprise talk, but a good one.  The youngster Vixie briefed us a bit on what iXSystems is doing with web presentation layer (for lack of a better description) of the FreeNAS implementation.



He started off by saying static web pages are, at least for apps like FreeNAS, not the way to go anymore.  Refreshing the DOM (Document Object Model) at regular intervals is not going to work well.  He then introduced us to a number of mature and nascent JavaScript/web technologies, some of which no one in the room had yet heard of.  Basically he had to rewrite the "old" Django/other technologies implementation to accomodate better simulation of a desktop app in the browser.



The specifics were not something I could follow well because of my ignorance.  There was talk of an Open Source, BSD licensed Facebook framework whose name I can't recall, a one-way change propagation architecture for updating the dynamic web page, and, as always, optimization of the process.  I asked him about Django after the talk.  He said it was the best thing a couple years ago for this app, but now they needed something that could interact directly with the browser - namely JavaScript - it comes down to fine-grained control and optimization.










One humorous interlude during the Q & A was my asking him if he was indeed related to Paul Vixie, historical UNIX tools author (Vixie Cron), to which he replied, "This is the part of my talk where I say, 'I am Worf, son of Mogh.'"  Anyone with a sense of humor and a knowledge of STTNG can't be all bad ;-)



A few people pics:













Dru Lavigne.  Without the BSDA cert program she helped found, I would never have gotten over the hump learning UNIX.  We differ on our choice of specific BSD, but I still consider her my UNIX mentor.













iXSystems old timers Denise and Matt working out conference specifics.













FreeBSD Foundation rep Anne.



Conclusion:  MeetBSD is an affordable, pretty meaty con if you like UNIX, hardware, and topics about optimization and scale.  It is, fortunately or unfortunately, a pretty well kept secret.



Thanks for stopping by.


Gtk.TreeView (grid view) with mono, gtk-sharp, and IronPython
2014-10-31T00:15:00.000-07:00
The post immediately prior to this one was an attempt to reproduce Windows.Forms Calendar controls in Gtk for cross platform (Windows/*nix) effective rendering.



This time I am attempting to get familiar with gtk-sharp/Gtk's version of a grid view - the Gtk.TreeView object.  Some of the gtk-sharp documentation suggests the NodeView object would be easier to use.  I had some trouble instantiating the objects associated with the NodeView and went with the TreeView instead in the hopes of getting more control.



The Windows.Forms GridView I did years ago is here.  It became apparent to me shortly after embarking on this journey that I would be hard pressed to recreate all the functionality of that script in a timely manner.  I settled for a tabular view of drillhole data (fabricated, mock data) with some custom formatting.



Aside:  this is typically how mineral exploration drillhole data (core, reverse circulation drilling) is presented in tabular format - a series of from-to intervals with assay values.  Assuming the assays are all separate elements, the reported weight percents should not sum more than 100%, and never do unless someone fat fingers a decimal place.  I've projected a couple screaming hot polymetallic drill holes that end near surface (lack of funding for drilling), but show enough promise that the new mining town of Trachteville (the drill hole name CBT-BNZA stands for CBT-Bonanza) will spring up there at any moment . . . one can dream.



The data store object for the grid view Gtk.ListStore object would not instantiate in IronPython.  I was not the only person to have experienced this problem (I cannot locate the link to the mailing list thread or forum reference, but like the big fish that got away, I swear I saw it).  I didn't want to drop the effort just because of that, so I hacked and compiled some C# code:



public class storex
{
    public Gtk.ListStore drillhole = 
                            // 7 columns
                            // drillhole id 
          new Gtk.ListStore (typeof (string),
                            // from
                            typeof (double),
                            // to
                            typeof (double),
                            // assay1
                            typeof (double),
                            // assay2
                            typeof (double),
                            // assay3
                            typeof (double),
                            // assay4
                            typeof (double));
}



The mono command on Windows was



C:\UserPrograms\Mono-3.2.3>mcs -pkg:gtk-sharp-2.0 /target:library C:\UserPrograms\IronPythonGUI\storex.cs 



Those are my file paths; locations depend on where you install things like mono and IronPython.



Anyway, I got my dll and I was off to the races.  Getting to know the Gtk and gtk-sharp object model proved challenging for me.  I'm glad I got some familiarity with it, but it would take me longer to do something in Gtk than it did with Windows.Forms.  The most fun and gratifying part of the project was getting the custom formatting to work with a Gtk.TreeCellDataFunc.  I used a function that yielded specific functions for each column - something that's really easy to do in Python.



Anyway, here are a couple screenshots and the IronPython code:















The OpenBSD one below turned out pretty good, but the Windows one had a little double line underneath the first row - it looked as though it was still trying to select that row when I told it specifically not to.  I'm not a design perfectionist Steve Jobs type, but niggling nits like that drive me batty.  For now, though it's best I publish the code and move on.



#!/usr/local/bin/mono /home/carl/IronPython-2.7.4/ipy64.exe

import clr

GTKSHARP = 'gtk-sharp'
PANGO = 'pango-sharp'

# Mock store C#
STOREX = 'storex'

clr.AddReference(GTKSHARP)
clr.AddReference(PANGO)

# C# module compiled for this project.
# Problems with Gtk.ListStore in IronPython.
clr.AddReference(STOREX)

import Gtk
import Pango

import storex

TITLE = 'Gtk.TreeView Demo (Drillholes)'
MARKUP = '<span font="Courier New" size="14" weight="bold">{:s}</span>'
MARKEDUPTITLE = MARKUP.format(TITLE)

CENTERED = 0.5
RIGHT = 1.0

WINDOWWIDTH = 350

COURFONTREGULAR = 'Courier New 12'
COURFONTBOLD = 'Courier New Bold 12'

DHNAME = 'DH_CBTBNZA-{:>02d}'
DHNAMELABEL = 'drillhole'
FROM = 'from'
TO = 'to'
ASSAY1 = 'assay1'
ASSAY2 = 'assay2'
ASSAY3 = 'assay3'
ASSAY4 = 'assay4'

FP1FMT = '{:>5.1f}'
FP2FMT = '{:>4.2f}'

DHDATAX = {(DHNAME.format(1), 0.0):{TO:8.7,
                                    ASSAY1:22.27,
                                    ASSAY2:4.93,
                                    ASSAY3:18.75,
                                    ASSAY4:35.18},
           (DHNAME.format(1), 8.7):{TO:15.3,
                                    ASSAY1:0.27,
                                    ASSAY2:0.09,
                                    ASSAY3:0.03,
                                    ASSAY4:0.22},
           (DHNAME.format(1), 15.3):{TO:25.3,
                                     ASSAY1:2.56,
                                     ASSAY2:11.34,
                                     ASSAY3:0.19,
                                     ASSAY4:13.46},
           (DHNAME.format(2), 0.0):{TO:10.0,
                                    ASSAY1:0.07,
                                    ASSAY2:1.23,
                                    ASSAY3:4.78,
                                    ASSAY4:5.13},
           (DHNAME.format(2), 10.0):{TO:20.0,
                                     ASSAY1:44.88,
                                     ASSAY2:12.97,
                                     ASSAY3:0.19,
                                     ASSAY4:0.03}}

FIELDS = [DHNAMELABEL, FROM, TO, ASSAY1, ASSAY2, ASSAY3, ASSAY4]
BOLDEDCOLUMNS = [DHNAMELABEL, FROM, TO] 
NONKEYFIELDS = FIELDS[2:]

BLAZINGCUTOFF = 10.0

def genericfloatformat(floatfmt, index):
    """
    For cell formatting in Gtk.TreeView.

    Returns a function to format floats
    and to format floats' foreground color
    based on cutoff value.

    floatfmt is a format string.

    index is an int that indicates the
    column being formatted.
    """
    def setfloatfmt(treeviewcolumn, cellrenderer, treemodel, treeiter):
        cellrenderer.Text = floatfmt.format(treemodel.GetValue(treeiter, index))
        # If it is one of the assay value columns.
        # XXX - not generic.
        if index > 2:
            if treemodel.GetValue(treeiter, index) > BLAZINGCUTOFF:
                cellrenderer.Foreground = 'red'
            else:
                cellrenderer.Foreground = 'black'
    return Gtk.TreeCellDataFunc(setfloatfmt)

class TreeViewTest(object):
    def __init__(self):
        Gtk.Application.Init()
        self.window = Gtk.Window('')
        # DeleteEvent - copied from Gtk demo on internet.
        self.window.DeleteEvent += self.DeleteEvent
        # Frame property provides a frame and title.
        self.frame = Gtk.Frame(MARKEDUPTITLE)
        self.tree = Gtk.TreeView()
        self.tree.EnableGridLines = Gtk.TreeViewGridLines.Both
        self.frame.Add(self.tree)

        # Fonts for formatting.
        self.fdregular = Pango.FontDescription.FromString(COURFONTREGULAR)
        self.fdbold = Pango.FontDescription.FromString(COURFONTBOLD)

        # C# module
        self.store = storex().drillhole

        self.makecolumns()
        self.adddata()
        self.tree.Model = self.store

        self.formatcolumns()
        self.formatcells()
        self.prettyup()

        self.window.Add(self.frame)
        self.window.ShowAll()
        # Keep text viewable - size no smaller than intended.
        self.window.AllowShrink = False
        # XXX - hack to keep lack of gridlines on edges of
        #       table from showing.
        self.window.AllowGrow = False
        # Unselect everything for this demo.
        self.tree.Selection.UnselectAll()
        Gtk.Application.Run()

    def makecolumns(self):
        """
        Fill in columns for TreeView.
        """
        self.columns = {}
        for fieldx in FIELDS:
            self.columns[fieldx] = Gtk.TreeViewColumn()
            self.columns[fieldx].Title = fieldx
            self.tree.AppendColumn(self.columns[fieldx])

    def formatcolumns(self):
        """
        Make custom labels for columnn headers.

        Get each column properly justified (all
        are right justified,floating point numbers
        except for the drillhole 'number' - 
        actually a string).
        """
        self.customlabels = {}

        for fieldx in FIELDS:
            # This centers the labels at the top.
            self.columns[fieldx].Alignment = CENTERED
            self.customlabels[fieldx] = Gtk.Label(self.columns[fieldx].Title)
            self.customlabels[fieldx].ModifyFont(self.fdbold)
            # 120 is about right for from, to, and assay columns.
            self.columns[fieldx].MinWidth = 120
            self.customlabels[fieldx].ShowAll()
            self.columns[fieldx].Widget = self.customlabels[fieldx]
            # ShowAll required for new label to take.
            self.columns[fieldx].Widget.ShowAll()

    def formatcells(self):
        """
        Add and format cell renderers.
        """
        self.cellrenderers = {}

        for fieldx in FIELDS:
            self.cellrenderers[fieldx] = Gtk.CellRendererText()
            self.columns[fieldx].PackStart(self.cellrenderers[fieldx], True)
            # Drillhole 'number' (string)
            if fieldx == FIELDS[0]:
                self.cellrenderers[fieldx].Xalign = CENTERED
                self.columns[fieldx].AddAttribute(self.cellrenderers[fieldx], 
                        'text', 0)
            else:
                self.cellrenderers[fieldx].Xalign = RIGHT
                try:
                    self.columns[fieldx].AddAttribute(self.cellrenderers[fieldx], 
                            'text', FIELDS.index(fieldx))
                except ValueError:
                    print('\n\nProblem with field definitions; field not found.\n\n')
        for fieldx in BOLDEDCOLUMNS:
            self.cellrenderers[fieldx].Font = COURFONTBOLD
        self.columns[fieldx].Widget.ShowAll()

        # XXX - not very generic, but better than doing them one by one.
        # from, to columns.
        for x in xrange(1, 3):
            self.columns[FIELDS[x]].SetCellDataFunc(self.cellrenderers[FIELDS[x]],
                    genericfloatformat(FP1FMT, x))
        # assay<x> columns.
        for x in xrange(3, 7):
            self.columns[FIELDS[x]].SetCellDataFunc(self.cellrenderers[FIELDS[x]],
                    genericfloatformat(FP2FMT, x))

    def usemarkup(self):
        """
        Refreshes UseMarkup property on widgets (labels)
        so that they display properly and without 
        markup text.
        """
        # Have to refresh this property each time.
        self.frame.LabelWidget.UseMarkup = True

    def prettyup(self):
        """
        Get Gtk objects looking the way we
        intended.
        """
        # Try to get Courier New on treeview.
        self.tree.ModifyFont(self.fdregular)
        # Get rid of line.
        self.frame.Shadow = Gtk.ShadowType.None
        self.usemarkup()

    def adddata(self):
        """
        Put data into store.
        """
        # XXX - difficulty figuring out sorting
        #       function for TreeView.  Hack it
        #       with dictionary here.
        keytuples = [key for key in DHDATAX]
        keytuples.sort()
        datax = []
        for tuplex in keytuples:
            # XXX - side effect comprehension.
            #       Not great for readability,
            #       but compact.
            [datax.append(x) for x in tuplex]
            for fieldx in NONKEYFIELDS:
                datax.append(DHDATAX[tuplex][fieldx])
            self.store.AppendValues(*datax)
            # Reinitiialize data row list.
            datax = []

    def DeleteEvent(self, widget, event):
        Gtk.Application.Quit()

if __name__ == '__main__':
    TreeViewTest()



Thanks for stopping by. 



Mono gtk-sharp IronPython CalendarView
2014-10-30T23:25:00.000-07:00
A number of years ago I did a post on the IronPython Cookbook site about the Windows.Forms Calendar control.  I could never get the thing to render nicely on *nix operating systems (BSD family).  It sounds as though Windows.Forms development for mono (and in general) is kind of dead, so there is not much hope that solution/example will ever render nicely on *nix.  Recently I've been playing with mono and decided to give gtk-sharp a shot with IronPython.



Quick disclaimers:



1) I suspect from the examples I've seen on the internet that PyGtk is a little easier to deal with than gtk-sharp.  That's OK; I wanted to use IronPython and have the rest of the mono/dotNet framework available, so I went through the extra trouble to forego CPython and PyGtk and go with IronPython and gtk-sharp instead.



2) The desktop is not the most cutting edge or sexy platform in 2014.  Nonetheless, where I work it is alive and well.  When I no longer see engineers hacking solutions in Excel and VBA, I'll consider the possibility of outliving the desktop.  Right now I'm not hopeful :-\



The results aren't bad, at least as far as rendering goes.  I couldn't get the Courier font to take on OpenBSD, but the Gtk Calendar control looks acceptable.  All in all, I was OK with the results on both Windows and OpenBSD.  I've heard Gtk doesn't do quite as well on Apple products, but I don't own a Mac to test with.  Here are a couple screenshots:

















I run the cwm window manager on OpenBSD and have it set up to cut out borders on windows, hence the more minimalist look to the control there.



IronPython output on *nix has always come out in yellow or white - it doesn't show up on a white background, which I prefer.  In order to get around this, I run an xterm with a black background:



xterm -bg black -fg white



Here is the code for the gtk-sharp Gtk.Calendar control:



#!/usr/local/bin/mono /home/carl/IronPython-2.7.4/ipy64.exe

import clr

GTKSHARP = 'gtk-sharp'
PANGO = 'pango-sharp'

clr.AddReference(GTKSHARP)
clr.AddReference(PANGO)

import Gtk
import Pango

import datetime

TITLE = 'Gtk.Calendar Demo'
MARKUP = '<span font="Courier New" size="14" weight="bold">{:s}</span>'
MARKEDUPTITLE = MARKUP.format(TITLE)

INFOMSG = '<span font="Courier New 12">\n\n Program set to run for:\n\n '
INFOMSG += '{:%Y-%m-%d}\n\n</span>'

DATEDIFFMSG = '<span font="Courier New 12">\n\n '
DATEDIFFMSG += 'There are {0:d} days between the\n'
DATEDIFFMSG += ' beginning of the epoch and\n'
DATEDIFFMSG += ' {1:%Y-%m-%d}.\n\n</span>'

ALIGNMENTPARAMS = (0.0, 0.5, 0.0, 0.0)

WINDOWWIDTH = 350

CALENDARFONT = 'Courier New Bold 12'

class CalendarTest(object):
    inthebeginning = datetime.datetime.fromtimestamp(0)
    # Debug info - make sure beginning of epoch really
    #              is +midnight, Jan 1, 1970 GMT.
    print(inthebeginning)
    def __init__(self):
        Gtk.Application.Init()
        self.window = Gtk.Window(TITLE)
        # DeleteEvent - copied from Gtk demo on internet.
        self.window.DeleteEvent += self.DeleteEvent
        # Frame property provides a frame and title.
        self.frame = Gtk.Frame(MARKEDUPTITLE)
        self.calendar = Gtk.Calendar()
        # Handles date selection event.
        self.calendar.DaySelected += self.dateselect
        # Sets up text for labels.
        self.getcaltext()
        # Puts little box around text.
        self.datelabelframe = Gtk.Frame()
        # Try to get datelabel to align with other label.
        self.datelabelalignment = Gtk.Alignment(*ALIGNMENTPARAMS)
        self.datelabel = Gtk.Label(self.caltext)
        self.datelabelalignment.Add(self.datelabel)
        self.datelabelframe.Add(self.datelabelalignment)
        # Puts little box around text.
        self.datedifflabelframe = Gtk.Frame()
        self.datedifflabelalignment = Gtk.Alignment(*ALIGNMENTPARAMS)
        self.datedifflabel = Gtk.Label(self.timedifftext)
        self.datedifflabelalignment.Add(self.datedifflabel)
        self.datedifflabelframe.Add(self.datedifflabelalignment)
        self.vbox = Gtk.VBox()
        self.vbox.PackStart(self.datelabelframe)
        self.vbox.PackStart(self.datedifflabelframe)
        self.vbox.PackStart(self.calendar)
        self.frame.Add(self.vbox)
        self.window.Add(self.frame)
        self.prettyup()
        self.window.ShowAll()
        # Keep text viewable - size no smaller than intended.
        self.window.AllowShrink = False
        Gtk.Application.Run()

    def getcaltext(self):
        """
        Get messages for run date.
        """
        # Calendar month is 0 based.
        yearmonthday = self.calendar.Year, self.calendar.Month + 1, self.calendar.Day
        chosendate = datetime.datetime(*yearmonthday)
        self.caltext = INFOMSG.format(chosendate)
        # For reporting of number of days since beginning of epoch.
        timediff = chosendate - CalendarTest.inthebeginning
        self.timedifftext = DATEDIFFMSG.format(timediff.days, chosendate)

    def usemarkup(self):
        """
        Refreshes UseMarkup property on widgets (labels)
        so that they display properly and without 
        markup text.
        """
        # Have to refresh this property each time.
        self.frame.LabelWidget.UseMarkup = True
        self.datelabel.UseMarkup = True
        self.datedifflabel.UseMarkup = True

    def prettyup(self):
        """
        Get Gtk objects looking the way we
        intended.
        """
        # Try to make frame wider.
        # XXX
        # Works nicely on Windows - try on Unix.
        # Allows bold, etc.
        self.usemarkup()
        self.frame.SetSizeRequest(WINDOWWIDTH, -1)
        # Get rid of line in middle of text on title.
        self.frame.Shadow = Gtk.ShadowType.None
        # Try to get Courier New on calendar.
        fd = Pango.FontDescription.FromString(CALENDARFONT)
        self.calendar.ModifyFont(fd)
        self.datelabel.Justify = Gtk.Justification.Left
        self.datedifflabel.Justify = Gtk.Justification.Left
        self.window.Title = ''
        self.usemarkup()

    def dateselect(self, widget, event):
        self.getcaltext()
        self.datelabel.Text = self.caltext
        self.datedifflabel.Text = self.timedifftext
        self.prettyup()

    def DeleteEvent(self, widget, event):
        Gtk.Application.Quit()

if __name__ == '__main__':
    CalendarTest()



Thanks for stopping by.  





subprocess.Popen() or Abusing a Home-grown Windows Executable
2014-10-20T14:11:00.002-07:00
Each month I redo 3D block model interpolations for a series of open pits at a distant mine.  Those of you who follow my twitter feed often see me tweet, "The 3D geologic block model interpolation chuggeth . . ."  What's going on is that I've got all the processing power maxed out dealing with millions of model blocks and thousands of data points.  The machine heats up and with the fan sounds like a DC-9 warming up before flight.

All that said, running everything roughly in parallel is more efficient time-wise than running it sequentially.  An hour of chugging is better than four.  The way I've been doing this is using the Python (2.7) subprocess module's Popen method, running my five interpolated values in parallel.  Our Python programmer Lori originally wrote this to run in sequence for a different set of problems.  I bastardized it for my own.



The subprocess part of the code is relatively straightforward.  Function startprocess() in my code covers that.

What makes this problem a little more challenging:



1) it's a vendor supplied executable we're dealing with . . . without an API or source . . . that's interactive (you can't feed it the config file path; it asks for it).  This results in a number of time.sleep() and <process>.stdin.write() calls that can be brittle.



2) getting the processes started, as I just mentioned, is easy.  Finding out when to stop, or kill them, requires knowledge of the app and how it generates output.  I've gone for an ugly, but effective check of report file contents.



3) while waiting for the processes to finish their work, I need to know things are working and what's going on.  I've accomplished this by reporting the data files' sizes in MB.



4) the executable isn't designed for a centralized code base (typically all scripts are kept in a folder for the specific project or pit), so it only allows about 100 character columns in the file paths sent to it.  I've omitted this from my sanitized version of the code, but it made things even messier than they are below.  Also, I don't know if all Windows programs do this, but the paths need to be inside quotes - the path kept breaking on the colon (:) when not quoted.



Basically, this is a fairly ugly problem and a script that requires babysitting while it runs.  That's OK; it beats the alternative (running it sequentially while watching each run).  I've tried to adhere to DRY (don't repeat yourself) as much as possible, but I suspect this could be improved upon.



The reason why I blog it is that I suspect there are other people out there who have to do the same sort of thing with their data.  It doesn't have to be a mining problem.  It can be anything that requires intensive computation across voluminous data with an executable not designed with a Python API.



Notes:  



1) I've omitted the file multirunparameters.py that's in an import statement.  It has a bunch of paths and names that are relevant to my project, but not to the reader's programming needs.



2) python 2.7 is listed at the top of the file as "mpython."  This is the Python that our mine planning vendor ships that ties into their quite capable Python API.  The executable I call with subprocess.Popen() is a Windows executable provided by a consultant independent of the mine planning vendor.  It just makes sense to package this interpolation inside the mine planning vendor's multirun (~ batch file) framework as part of an overall working of the 3D geologic block model.  The script exits as soon as this part of the batch is complete.  I've inserted a 10 second pause at the end just to allow a quick look before it disappears.



#!C:/MineSight/x64/mpython



"""
Interpolate grades with <consultant> program
from text files.
"""



import argparse

import subprocess as subx
import os
import collections as colx

import time
from datetime import datetime as dt



# Lookup file of constants, pit names, assay names, paths, etc.
import multirunparameters as paramsx



parser = argparse.ArgumentParser()
# 4 letter argument like 'kwat'
# Feed in at command line.
parser.add_argument('pit', help='four letter, lower case pit abbreviation (kwat)', type=str)
args = parser.parse_args()
PIT = args.pit



pitdir = paramsx.PATHS[PIT]
pathx = paramsx.BASEPATH.format(pitdir)
controlfilepathx = paramsx.CONTROLFILEPATH.format(pitdir)



timestart = dt.now()
print(timestart)



PROGRAM = 'C:/MSPROJECTS/EOMReconciliation/2014/Multirun/AllPits/consultantprogram.exe'



ENDTEXT = 'END <consultant> REPORT'



# These names are the only real difference between pits.

# Double quote is for subprocess.Popen object's stdin.write method
# - Windows path breaks on colon without quotes.
ASSAY1DRIVER = 'KDriverASSAY1{:s}CBT.csv"'.format(PIT)
ASSAY2DRIVER = 'KDriverASSAY2{:s}CBT.csv"'.format(PIT)
ASSAY3DRIVER = 'KDriverASSAY3_{:s}CBT.csv"'.format(PIT)
ASSAY4DRIVER = 'KDriverASSAY4_{:s}CBT.csv"'.format(PIT)
ASSAY5DRIVER = 'KDriverASSAY5_{:s}CBT.csv"'.format(PIT)



RETCHAR = '\n'



ASSAY1 = 'ASSAY1'
ASSAY2 = 'ASSAY2'
ASSAY3 = 'ASSAY3'
ASSAY4 = 'ASSAY4'
ASSAY5 = 'ASSAY5'



NAME = 'name'
DRFILE = 'driver file'
OUTPUT = 'output'
DATFILE = 'data file'
RPTFILE = 'report file'



# data, report files
ASSAY1K = 'ASSAY1K.csv'
ASSAY1RPT = 'ASSAY1.RPT'

ASSAY2K = 'ASSAY2K.csv'
ASSAY2RPT = 'ASSAY2.RPT'

ASSAY3K = 'ASSAY3K.csv'
ASSAY3RPT = 'ASSAY3.RPT'

ASSAY4K = 'ASSAY4K.csv'
ASSAY4RPT = 'ASSAY4.RPT'

ASSAY5K = 'ASSAY5K.csv'
ASSAY5RPT = 'ASSAY5.RPT'



OUTPUTFMT = '{:s}output.txt'



ASSAYS = {1:{NAME:ASSAY1,
             DRFILE:controlfilepathx + ASSAY1DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY1),
             DATFILE:pathx + ASSAY1K,
             RPTFILE:pathx + ASSAY1RPT},
          2:{NAME:ASSAY2,
             DRFILE:controlfilepathx + ASSAY2DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY2),
             DATFILE:pathx + ASSAY2K,
             RPTFILE:pathx + ASSAY2RPT},
          3:{NAME:ASSAY3,
             DRFILE:controlfilepathx + ASSAY3DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY3),
             DATFILE:pathx + ASSAY3K,
             RPTFILE:pathx + ASSAY3RPT},
          4:{NAME:ASSAY4,
             DRFILE:controlfilepathx + ASSAY4DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY4),
             DATFILE:pathx + ASSAY4K,
             RPTFILE:pathx + ASSAY4RPT},
          5:{NAME:ASSAY5,
             DRFILE:controlfilepathx + ASSAY5DRIVER,
             OUTPUT:pathx + OUTPUTFMT.format(ASSAY5),
             DATFILE:pathx + ASSAY5K,
             RPTFILE:pathx + ASSAY5RPT}}



DELFILE = 'delete file'
INTERP = 'interp'
SLEEP = 'sleep'
MSGDRIVER = 'message driver'
MSGRETCHAR = 'message return character'
FINISHED1 = 'finished one assay'
FINISHEDALL = 'finished all interpolations'
TIMEELAPSED = 'time elapsed'
FILEEXISTS = 'report file exists'
DATSIZE = 'data file size'
DONE = 'number interpolations finished'
DATFILEEXIST = 'data file not yet there'
SIZECHANGE = 'report file changed size'



# for converting to megabyte file size from os.stat()
BITSHIFT = 20

# sleeptime - 5 seconds
SLEEPTIME = 5

FINISHED = 'finished'

RPTFILECHSIZE = """
         
Report file for {:s}
changed size; killing process . . .

"""



MESGS = {DELFILE:'\n\nDeleting {} . . .\n\n',
         INTERP:'\n\nInterpolating {:s} . . .\n\n',
         SLEEP:'\nSleeping 2 seconds . . .\n\n',
         MSGDRIVER:'\n\nWriting driver file name to stdin . . .\n\n',
         MSGRETCHAR:'\n\nWriting retchar to stdin for {:s} . . .\n\n',
         FINISHED1:'\n\nFinished {:s}\n\n',
         FINISHEDALL:'\n\nFinished interpolation.\n\n',
         TIMEELAPSED:'\n\n{:d} elapsed seconds\n\n',
         FILEEXISTS:'\n\nReport file for {:s} exists . . .\n\n',
         DATSIZE:'\n\nData file size for {:s} is now {:d}MB . . .\n\n',
         DONE:'\n\n{:d} out of {:d} assays are finished . . .\n\n',
         DATFILEEXIST:"\n\n{:s} doesn't exist yet . . .\n\n",
         SIZECHANGE:RPTFILECHSIZE}



def cleanslate():
    """
    Delete all output files prior to interpolation
    so that their existence can be tracked.
    """
    for key in ASSAYS:
        files = (ASSAYS[key][DATFILE],
                 ASSAYS[key][RPTFILE],
                 ASSAYS[key][OUTPUT])
        for filex in files:
            print(MESGS[DELFILE].format(filex))
            if os.path.exists(filex) and os.path.isfile(filex):
                os.remove(filex)
    return 0



def startprocess(assay):
    """
    Start <consultant program> run for given interpolation.

    Return subprocess.Popen object,
    file object (output file).
    """
    print(MESGS[INTERP].format(ASSAYS[assay][NAME]))
    # XXX - I hate time.sleep - hack
    # XXX - try to re-route standard output so that
    #       it's not all jumbled together.
    print(MESGS[SLEEP])
    time.sleep(2)
    # output file for stdout
    f = open(ASSAYS[assay][OUTPUT], 'w')
    procx = subx.Popen('{0}'.format(PROGRAM), stdin=subx.PIPE, stdout=f)
    print(MESGS[SLEEP])
    time.sleep(2)
    # XXX - problem, starting up Excel CBT 22JUN2014
    #       Ah - this is what happens when the <software usb licence>
    #            key is not attached :-(
    print(MESGS[MSGDRIVER])
    print('\ndriver file = {:s}\n'.format(ASSAYS[assay][DRFILE]))
    procx.stdin.write(ASSAYS[assay][DRFILE])
    print(MESGS[SLEEP])
    time.sleep(2)
    # XXX - this is so jacked up -
    #       no idea what is happening when
    print(MESGS[MSGRETCHAR].format(ASSAYS[assay][NAME]))
    procx.stdin.write(RETCHAR)
    print(MESGS[SLEEP])
    time.sleep(2)
    print(MESGS[MSGRETCHAR].format(ASSAYS[assay][NAME]))
    procx.stdin.write(RETCHAR)
    print(MESGS[SLEEP])
    time.sleep(2)
    return procx, f



def crosslookup(assay):
    """
    From assay string, get numeric
    key for ASSAYS dictionary.

    Returns integer.
    """
    for key in ASSAYS:
        if assay == ASSAYS[key][NAME]:
            return key
    return 0



def checkprocess(assay, assaydict):
    """
    Check to see if assay
    interpolation is finished.

    assay is the item in question
    (ASSAY1, ASSAY2, etc.).

    assaydict is the operating dictionary
    for the assay in question.

    Returns True if finished.
    """
    # Report file indicates process finished.
    assaykey = crosslookup(assay)
    rptfile = ASSAYS[assaykey][RPTFILE]
    datfile = ASSAYS[assaykey][DATFILE]
    if os.path.exists(datfile) and os.path.isfile(datfile):
        # Report size of file in MB.
        datfilesize = os.stat(datfile).st_size >> BITSHIFT
        print(MESGS[DATSIZE].format(assay, datfilesize))
    else:
        # Doesn't exist yet.
        print(MESGS[DATFILEEXIST].format(datfile))
    if os.path.exists(rptfile) and os.path.isfile(rptfile):
        # XXX - not the most efficient way,
        #       but this checking the file appears
        #       to work best.
        f = open(rptfile, 'r')
        txt = f.read()
        f.close()
        # XXX - hack - gah.
        if txt.find(ENDTEXT) > -1:
            # looking for change in reportfile size
            # or big report file
            print(MESGS[SIZECHANGE].format(assay))
            print(MESGS[SLEEP])
            time.sleep(2)
            return True
    return False



PROCX = 'process'
OUTPUTFILE = 'output file'



# Keeps track of files and progress of <consultant program>.
opdict = colx.OrderedDict()



# get rid of preexisting files
cleanslate()


# start all five roughly in parallel
# ASSAYS keys are numbers
for key in ASSAYS:
    # opdict - ordered with assay names as keys
    namex = ASSAYS[key][NAME]
    opdict[namex] = {}
    assaydict = opdict[namex]
    assaydict[PROCX], assaydict[OUTPUTFILE] = startprocess(key)
    # Initialize active status of process.
    assaydict[FINISHED] = False



# For count.
numassays = len(ASSAYS)
# Loop until all finished.
while True:
    # Cycle until done then break.
    # Sleep SLEEPTIME seconds at a time and check between.
    time.sleep(SLEEPTIME)
    # Count.
    i = 0
    for key in opdict:
        assaydict = opdict[key]
        if not assaydict[FINISHED]:
            status = checkprocess(key, assaydict)
            if status:
                # kill process when report file changes
                opdict[key][PROCX].kill()
                assaydict[FINISHED] = True
                i += 1
        else:
            i += 1
    print(MESGS[DONE].format(i, numassays))
    # all done
    if i == numassays:
        break



print('\n\nFinished interpolation.\n\n')
timeend = dt.now()
elapsed = timeend - timestart



print(MESGS[TIMEELAPSED].format(elapsed.seconds))
print('\n\n{:d} elapsed minutes\n\n'.format(elapsed.seconds/60))



# Allow quick look at screen.
time.sleep(10)