pyright: July 2024

Friday, July 5, 2024

Graphviz - Editing a DAG Hamilton Graph dot File

Last post featured the DAG Hamilton generated graphviz graph shown below. I'll be dressing this up a little and highlighting some functionality. For the toy example here, the script employed is a bit of overkill. For a bigger workflow, it may come in handy.

I'll start with the finished products:

1) A Hamilton logo and a would be company logo get added (manual; the Data Inputs Highlighted subtitle is there for later processing when we highlight functionality.)

2) through 4) are done programmatically (code is shown further down). I saw an example on the Hamilton web pages that used aquamarine as the highlight color; I liked that, so I stuck with it.

2) Data source and data source function highlighted.

3) Web scraping functions highlighted.

4) Output nodes highlighted.

A few observations and notes before we look at configuration and code: I've found the charts to be really helpful in presenting my workflow to users and leadership (full disclosure: my boss liked some initial charts I made; my dream of the PowerPoint to solve all scripter<->customer communication challenges is not yet reality, but for the first time in a long time, I have hope.)

In the web scraping highlighted diagram, you can pretty clearly see that data_with_company node has an input into the commodity_word_counts node. The domain specific rationale from the last blog post is that I don't want to count every "Barrick Gold" company name occurrence as another mention of "Gold" or "gold."

Toy example notwithstanding, in real life, being able to show where something branches critically is a real help. Assumptions about what a script is actually doing versus what it is doing can actually be costly in terms of time and productivity for all parties. Being able to say and show ideas like, "What it's doing over here doesn't carry over to that other mission critical part you're really concerned with; it's only for purposes of the visualization which lies over here on the diagram" or "This node up here representing <the real life thing> is your sole source of input for this script; it is not looking at <other real world thing> at all."

graphviz and diagrams like this have been around for decades - UML, database schema visualizations, etc. What makes this whole DAG Hamilton thing better for me is how easy and accessible it is. I've seen C++ UML diagrams over the years (all respect to the C++ people - it takes a lot of ability, discipline, and effort); my first thought is often, "Oh wow . . . I'm not sure I have what it takes to do that . . . and I'm not sure I'd want to . . ."

Enough rationalization and qualifying - on to the config and the code!

I added the title and logos manually. The assumption that the graphviz dot file output of DAG Hamilton will always be in the format shown would be premature and probably wrong. It's an implementation detail subject to change and not a feature. That said, I needed some features in my graph outputs and I achieved them this one time.

Towards the top of the dot file is where the title goes:

// Dependency Graph

digraph {

labelloc="t"

label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica

labelalloc="t" puts the text at the top of the graph (t for top, I think).

// Dependency Graph

digraph {

labelloc="t"

label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica

hamiltonlogo [label="" image="hamiltonlogolarge.png" shape="box", width=0.6, height=0.6, fixedsize=true]

companylogo [label="" image="fauxcompanylogo.png" shape="box", width=5.10 height=0.6 fixedsize=true]

The DAG Hamilton logo listed first appears to end up in the upper left part of the diagram most of the time (this is an empirical observation on my part; I don't have a super great handle on the internals of graphviz yet).

Getting the company logo next to it requires a bit more effort. A StackOverflow exchange had a suggestion of connecting it invisibly to an initial node. In this case, that would be the data source. Inputs in DAG Hamilton don't get listed in the graphviz dot file by their names, but rather by the node or nodes they are connected to: _parsed_data_inputs instead of "datafile" like you might expect. I have a preference for listing my input nodes only once (deduplicate_inputs=True is the keyword argument to DAG Hamilton's driver object's display_all_functions method that makes the graph).

The change is about one third of the way down the dot file where the node connection edges start getting listed:

parsed_data -> data_with_wikipedia

_parsed_data_inputs [label=<<table border="0"><tr><td>datafile</td><td>str</td></tr></table>> fontname=Helvetica margin=0.15 shape=rectangle style="filled,dashed" fillcolor="#ffffff"]

companylogo -> _parsed_data_inputs [style=invis]

DAG Hamilton has a dashed box for script inputs. That's why there is all that extra description inside the square brackets for that node. I manually added the fillcolor="#ffffff" at the end. It's not necessary for the chart (I believe the default fill of white /#ffffff was specified near the top of the file), but it is necessary for the code I wrote to replace the existing color with something else. Otherwise, it does not affect the output.

I think that's it for manual prep.

Onto the code. Both DAG Hamilton and graphviz have API's for customizing the graphviz dot file output. I've opted to approach this with brute force text processing. For my needs, this is the best option. YMMV. In general, text processing any code or configuration tends to be brittle. It worked this time.

# python 3.12

"""

Try to edit properties of graphviz output.

"""

import sys

import re

import itertools

import graphviz

INPUT = 'ts_with_logos_and_colors'

FILLCOLORSTRLEN = 12

AQUAMARINE = '7fffd4'

COLORLEN = len(AQUAMARINE)

BOLDED = ' penwidth=5'

BOLDEDEDGE = ' [penwidth=5]'

NODESTOCOLOR = {'data_source':['_parsed_data_inputs',

'parsed_data'],

'webscraping':['data_with_wikipedia',

'colloquial_company_word_counts',

'data_with_company',

'commodity_word_counts'],

'output':['info_output',

'info_dict_merged',

'wikipedia_report']}

EDGEPAT = r'\b{0:s}\b[ ][-][>][ ]\b{1:s}\b'

TITLEPAT = r'Toy Web Scraping Script Run Diagram[<]BR[/][>]'

ENDTITLEPAT = r'</b>>'

# Two tuples as values for edges.

EDGENODESTOBOLD = {'data_source':[('_parsed_data_inputs', 'parsed_data')],

'webscraping':[('data_with_wikipedia', 'colloquial_company_word_counts'),

('data_with_wikipedia', 'data_with_company'),

('data_with_wikipedia', 'commodity_word_counts'),

('data_with_company', 'commodity_word_counts')],

'output':[('data_with_company', 'info_output'),

('colloquial_company_word_counts', 'info_dict_merged'),

('commodity_word_counts', 'info_dict_merged'),

('info_dict_merged', 'wikipedia_report'),

('data_with_company', 'info_dict_merged')]}

OUTPUTFILES = {'data_source':'data_source_highlighted',

'webscraping':'web_scraping_functions_highlighted',

'output':'output_functions_highlighted'}

TITLES = {'data_source':'Data Sources and Data Source Functions Highlighted',

'webscraping':'Web Scraping Functions Highlighted',

'output':'Output Functions Highlighted'}

def get_new_source_nodecolor(src, nodex):

"""

Return new source string for graphviz

with selected node colored aquamarine.

src is the original graphviz text source

from file.

nodex is the node to have it's color edited.

"""

# Full word, exact match.

wordmatchpat = r'\b' + nodex + r'\b'

pat = re.compile(wordmatchpat)

# Empty string to hold full output of edited source.

src2 = ''

match = re.search(pat, src)

# nodeidx = src.find(nodex)

nodeidx = match.span()[0]

print('nodeidx = ', nodeidx)

src2 += src[:nodeidx]

idxcolor = src[nodeidx:].find('fillcolor')

print('idxcolor = ', idxcolor)

# fillcolor="#b4d8e4"

# 012345678901234567

src2 += src[nodeidx:nodeidx + idxcolor + FILLCOLORSTRLEN]

src2 += AQUAMARINE

currentposit = nodeidx + idxcolor + FILLCOLORSTRLEN + COLORLEN

src2 += src[currentposit:]

return src2

def get_new_title(src, title):

"""

Return new source string for graphviz

with new title part of header.

src is the original graphviz text source

from file.

title is a string.

"""

# Empty string to hold full output of edited source.

src2 = ''

match = re.search(TITLEPAT, src)

titleidx = match.span()[1]

print('titleidx = ', titleidx)

src2 += src[:titleidx]

idxendtitle = src[titleidx:].find(ENDTITLEPAT)

print('idxendtitle = ', idxendtitle)

src2 += title

currentposit = titleidx + idxendtitle

print('currentposit = ', currentposit)

src2 += src[currentposit:]

return src2

def get_new_source_penwidth_nodes(src, nodex):

"""

Return new source string for graphviz

with selected node having bolded border.

src is the original graphviz text source

from file.

nodex is the node to have its box bolded.

"""

# Full word, exact match.

wordmatchpat = r'\b' + nodex + r'\b'

pat = re.compile(wordmatchpat)

# Empty string to hold full output of edited source.

src2 = ''

match = re.search(pat, src)

nodeidx = match.span()[0]

print('nodeidx = ', nodeidx)

src2 += src[:nodeidx]

idxbracket = src[nodeidx:].find(']')

src2 += src[nodeidx:nodeidx + idxbracket]

print('idxbracket = ', idxbracket)

src2 += BOLDED

src2 += src[nodeidx + idxbracket:]

return src2

def get_new_source_penwidth_edges(src, nodepair):

"""

Return new source string for graphviz

with selected node pair having bolded edge.

src is the original graphviz text source

from file.

nodepair is the two node tuple to have

its edge bolded.

"""

# Full word, exact match.

edgepat = EDGEPAT.format(*nodepair)

print(edgepat)

pat = re.compile(edgepat)

# Empty string to hold full output of edited source.

src2 = ''

match = re.search(pat, src)

edgeidx = match.span()[1]

print('edgeidx = ', edgeidx)

src2 += src[:edgeidx]

src2 += BOLDEDEDGE

src2 += src[edgeidx:]

return src2

def makehighlightedfuncgraphs():

"""

Cycle through functionalities to make specific

highlighted functional parts of the workflow

output graphs.

Returns dictionary of new filenames.

"""

with open(INPUT, 'r') as f:

src = f.read()

retval = {}

for functionality in TITLES:

print(functionality)

src2 = src

retval[functionality] = {'dot':None,

'svg':None,

'png':None}

src2 = get_new_title(src, TITLES[functionality])

# list of nodes.

to_process = (nodex for nodex in NODESTOCOLOR[functionality])

countergenerator = itertools.count()