Friday, July 5, 2024

Graphviz - Editing a DAG Hamilton Graph dot File

Last post featured the DAG Hamilton generated graphviz graph shown below. I'll be dressing this up a little and highlighting some functionality. For the toy example here, the script employed is a bit of overkill. For a bigger workflow, it may come in handy.





I'll start with the finished products:

1) A Hamilton logo and a would be company logo get added (manual; the Data Inputs Highlighted subtitle is there for later processing when we highlight functionality.)

2) through 4) are done programmatically (code is shown further down). I saw an example on the Hamilton web pages that used aquamarine as the highlight color; I liked that, so I stuck with it.

2) Data source and data source function highlighted.




3) Web scraping functions highlighted.



4) Output nodes highlighted.



A few observations and notes before we look at configuration and code: I've found the charts to be really helpful in presenting my workflow to users and leadership (full disclosure: my boss liked some initial charts I made; my dream of the PowerPoint to solve all scripter<->customer communication challenges is not yet reality, but for the first time in a long time, I have hope.)

In the web scraping highlighted diagram, you can pretty clearly see that data_with_company node has an input into the commodity_word_counts node. The domain specific rationale from the last blog post is that I don't want to count every "Barrick Gold" company name occurrence as another mention of "Gold" or "gold."

Toy example notwithstanding, in real life, being able to show where something branches critically is a real help. Assumptions about what a script is actually doing versus what it is doing can actually be costly in terms of time and productivity for all parties. Being able to say and show ideas like, "What it's doing over here doesn't carry over to that other mission critical part you're really concerned with; it's only for purposes of the visualization which lies over here on the diagram" or "This node up here representing <the real life thing> is your sole source of input for this script; it is not looking at <other real world thing> at all."

graphviz and diagrams like this have been around for decades - UML, database schema visualizations, etc. What makes this whole DAG Hamilton thing better for me is how easy and accessible it is. I've seen C++ UML diagrams over the years (all respect to the C++ people - it takes a lot of ability, discipline, and effort); my first thought is often, "Oh wow . . . I'm not sure I have what it takes to do that . . . and I'm not sure I'd want to . . ."

Enough rationalization and qualifying - on to the config and the code!

I added the title and logos manually. The assumption that the graphviz dot file output of DAG Hamilton will always be in the format shown would be premature and probably wrong. It's an implementation detail subject to change and not a feature. That said, I needed some features in my graph outputs and I achieved them this one time.

Towards the top of the dot file is where the title goes:

// Dependency Graph
digraph {
        labelloc="t"
        label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica

labelalloc="t" puts the text at the top of the graph (t for top, I think).

// Dependency Graph
digraph {
        labelloc="t"
        label=<<b>Toy Web Scraping Script Run Diagram<BR/>Data Inputs Highlighted</b>> fontsize="36" fontname=Helvetica
        hamiltonlogo [label="" image="hamiltonlogolarge.png" shape="box", width=0.6, height=0.6, fixedsize=true]
        companylogo [label="" image="fauxcompanylogo.png" shape="box", width=5.10 height=0.6 fixedsize=true]

The DAG Hamilton logo listed first appears to end up in the upper left part of the diagram most of the time (this is an empirical observation on my part; I don't have a super great handle on the internals of graphviz yet).

Getting the company logo next to it requires a bit more effort. A StackOverflow exchange had a suggestion of connecting it invisibly to an initial node. In this case, that would be the data source. Inputs in DAG Hamilton don't get listed in the graphviz dot file by their names, but rather by the node or nodes they are connected to: _parsed_data_inputs instead of "datafile" like you might expect. I have a preference for listing my input nodes only once (deduplicate_inputs=True is the keyword argument to DAG Hamilton's driver object's display_all_functions method that makes the graph).

The change is about one third of the way down the dot file where the node connection edges start getting listed:


parsed_data -> data_with_wikipedia
_parsed_data_inputs [label=<<table border="0"><tr><td>datafile</td><td>str</td></tr></table>> fontname=Helvetica margin=0.15 shape=rectangle style="filled,dashed" fillcolor="#ffffff"]
        companylogo -> _parsed_data_inputs [style=invis]

DAG Hamilton has a dashed box for script inputs. That's why there is all that extra description inside the square brackets for that node. I manually added the fillcolor="#ffffff" at the end. It's not necessary for the chart (I believe the default fill of white /#ffffff was specified near the top of the file), but it is necessary for the code I wrote to replace the existing color with something else. Otherwise, it does not affect the output.

I think that's it for manual prep.

Onto the code. Both DAG Hamilton and graphviz have API's for customizing the graphviz dot file output. I've opted to approach this with brute force text processing. For my needs, this is the best option. YMMV. In general, text processing any code or configuration tends to be brittle. It worked this time.


# python 3.12

"""
Try to edit properties of graphviz output.
"""

import sys

import re

import itertools

import graphviz

INPUT = 'ts_with_logos_and_colors'

FILLCOLORSTRLEN = 12
AQUAMARINE = '7fffd4'
COLORLEN = len(AQUAMARINE)

BOLDED = ' penwidth=5'
BOLDEDEDGE = ' [penwidth=5]'

NODESTOCOLOR = {'data_source':['_parsed_data_inputs',
                               'parsed_data'],
                'webscraping':['data_with_wikipedia',
                               'colloquial_company_word_counts',
                               'data_with_company',
                               'commodity_word_counts'],
                'output':['info_output',
                          'info_dict_merged',
                          'wikipedia_report']}

EDGEPAT = r'\b{0:s}\b[ ][-][>][ ]\b{1:s}\b'

TITLEPAT = r'Toy Web Scraping Script Run Diagram[<]BR[/][>]'
ENDTITLEPAT = r'</b>>'

# Two tuples as values for edges.
EDGENODESTOBOLD = {'data_source':[('_parsed_data_inputs', 'parsed_data')],
                   'webscraping':[('data_with_wikipedia', 'colloquial_company_word_counts'),
                                  ('data_with_wikipedia', 'data_with_company'),
                                  ('data_with_wikipedia', 'commodity_word_counts'),
                                  ('data_with_company', 'commodity_word_counts')],
                   'output':[('data_with_company', 'info_output'),
                             ('colloquial_company_word_counts', 'info_dict_merged'),
                             ('commodity_word_counts', 'info_dict_merged'),
                             ('info_dict_merged', 'wikipedia_report'),
                             ('data_with_company', 'info_dict_merged')]}

OUTPUTFILES = {'data_source':'data_source_highlighted',
               'webscraping':'web_scraping_functions_highlighted',
               'output':'output_functions_highlighted'}

TITLES = {'data_source':'Data Sources and Data Source Functions Highlighted',
          'webscraping':'Web Scraping Functions Highlighted',
          'output':'Output Functions Highlighted'}

def get_new_source_nodecolor(src, nodex):
    """
    Return new source string for graphviz
    with selected node colored aquamarine.

    src is the original graphviz text source
    from file.

    nodex is the node to have it's color edited.
    """
    # Full word, exact match.
    wordmatchpat = r'\b' + nodex + r'\b'
    pat = re.compile(wordmatchpat)
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(pat, src)
    # nodeidx = src.find(nodex)
    nodeidx = match.span()[0]
    print('nodeidx = ', nodeidx)
    src2 += src[:nodeidx]
    idxcolor = src[nodeidx:].find('fillcolor')
    print('idxcolor = ', idxcolor)
    # fillcolor="#b4d8e4"
    # 012345678901234567
    src2 += src[nodeidx:nodeidx + idxcolor + FILLCOLORSTRLEN]
    src2 += AQUAMARINE
    currentposit = nodeidx + idxcolor + FILLCOLORSTRLEN + COLORLEN
    src2 += src[currentposit:]
    return src2

def get_new_title(src, title):
    """
    Return new source string for graphviz
    with new title part of header.

    src is the original graphviz text source
    from file.

    title is a string.
    """
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(TITLEPAT, src)
    titleidx = match.span()[1]
    print('titleidx = ', titleidx)
    src2 += src[:titleidx]
    idxendtitle = src[titleidx:].find(ENDTITLEPAT)
    print('idxendtitle = ', idxendtitle)
    src2 += title
    currentposit = titleidx + idxendtitle
    print('currentposit = ', currentposit)
    src2 += src[currentposit:]
    return src2

def get_new_source_penwidth_nodes(src, nodex):
    """
    Return new source string for graphviz
    with selected node having bolded border.

    src is the original graphviz text source
    from file.

    nodex is the node to have its box bolded.
    """
    # Full word, exact match.
    wordmatchpat = r'\b' + nodex + r'\b'
    pat = re.compile(wordmatchpat)
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(pat, src)
    nodeidx = match.span()[0]
    print('nodeidx = ', nodeidx)
    src2 += src[:nodeidx]
    idxbracket = src[nodeidx:].find(']')
    src2 += src[nodeidx:nodeidx + idxbracket]
    print('idxbracket = ', idxbracket)
    src2 += BOLDED
    src2 += src[nodeidx + idxbracket:]
    return src2

def get_new_source_penwidth_edges(src, nodepair):
    """
    Return new source string for graphviz
    with selected node pair having bolded edge.

    src is the original graphviz text source
    from file.

    nodepair is the two node tuple to have
    its edge bolded.
    """
    # Full word, exact match.
    edgepat = EDGEPAT.format(*nodepair)
    print(edgepat)
    pat = re.compile(edgepat)
    # Empty string to hold full output of edited source.
    src2 = ''
    match = re.search(pat, src)
    edgeidx = match.span()[1]
    print('edgeidx = ', edgeidx)
    src2 += src[:edgeidx]
    src2 += BOLDEDEDGE 
    src2 += src[edgeidx:]
    return src2

def makehighlightedfuncgraphs():
    """
    Cycle through functionalities to make specific
    highlighted functional parts of the workflow
    output graphs.

    Returns dictionary of new filenames.
    """
    with open(INPUT, 'r') as f:
        src = f.read()

    retval = {}
    
    for functionality in TITLES:
        print(functionality)
        src2 = src
        retval[functionality] = {'dot':None,
                                 'svg':None,
                                 'png':None}
        src2 = get_new_title(src, TITLES[functionality])
        # list of nodes.
        to_process = (nodex for nodex in NODESTOCOLOR[functionality])
        countergenerator = itertools.count()
        count = next(countergenerator)
        print('\nDoing node colors\n')
        for nodex in to_process:
            print(nodex)
            src2 = get_new_source_nodecolor(src2, nodex)
            count = next(countergenerator)
        to_process = (nodex for nodex in NODESTOCOLOR[functionality])
        countergenerator = itertools.count()
        count = next(countergenerator)
        print('\nDoing node bolding\n')
        for nodex in to_process:
            print(nodex)
            src2 = get_new_source_penwidth_nodes(src2, nodex)
            count = next(countergenerator)
        print('Bolding edges . . .')
        to_process = (nodex for nodex in EDGENODESTOBOLD[functionality])
        countergenerator = itertools.count()
        count = next(countergenerator)
        for nodepair in to_process:
            print(nodepair)
            src2 = get_new_source_penwidth_edges(src2, nodepair)
            count = next(countergenerator)
        print('Writing output files . . .')
        outputfile = OUTPUTFILES[functionality]
        with open(outputfile, 'w') as f:
            f.write(src2)
        graphviz.render('dot', 'png', outputfile)
        graphviz.render('dot', 'svg', outputfile)

makehighlightedfuncgraphs()

Thanks for stopping by.

Thursday, July 4, 2024

DAG Hamilton Workflow for Toy Text Processing Script

Hello. It's been a minute.

I was fortunate to attend PYCON US in Pittsburgh earlier this year. DAGWorks had a booth on the expo floor where I discovered Hamilton. The project grabbed my attention as something that could help organize and present my code workflow better. My reaction could be compared to browsing Walmart while picking up a hardware item and seeing the perfect storage medium for your clothes or crafts at a bargain price, but even better, having someone there to explain the whole thing to you. The folks at the booth were really helpful.




Below I take on a contrived web scraping (it's crude) script in my domain (metals mining) and create a Hamilton workflow from it.

Pictured below is the Hamilton flow in the graphviz output format the project uses for flowcharts (graphviz has been around for decades - an oldie but goodie as it were).






I start with a csv file that has some really basic data on three big American metal mines (I did have to research the Wikipedia addresses - for instance, I originally looked for the Goldstrike Mine under the name "Post-Betze." It goes by several different names and encompasses several mines - more on that anon):

mine,state,commodity,wikipedia page,colloquial association
Red Dog,Alaska,zinc,https://en.wikipedia.org/wiki/Red_Dog_mine,Teck
Goldstrike,Nevada,gold,https://en.wikipedia.org/wiki/Goldstrike_mine,Nevada Gold Mines
Bingham Canyon,Utah,copper,https://en.wikipedia.org/wiki/Bingham_Canyon_Mine,Kennecott

Basically, I am going to attempt to scrape Wikipedia for information on who owns the three mines. Then I will try to use heuristics to gather information on what I think I know about them and gauge how up to date the Wikipedia information is.

Hamilton uses a system whereby you name your functions in a noun-like fashion ("def stuff()" instead of "def getstuff()") and feed those names as variables to the other functions in the workflow as parameters. This is what allows the tool to check your workflow for inconsistencies (types, for instance) and build the graphviz chart shown above.

You can use separate modules with functions and import them. I've done some of this on the bigger workflows I work with. Your Hamilton functions then end up being little one liners that call the bigger functions in the modules. This is necessary if you have functions you use repeatedly in your workflow that take different values at different stages. For this toy project, I've kept the whole thing self contained in one module toyscriptiii.py (yes, the iii in the filename represents my multiple failed attempts at web scraping and text processing - it's harder than it looks).

Below is the Hamilton main file run.py (I believe the "run.py" name is convention.) I have done my best to preserve the dictionary return values as "faux immutable" through use of the copy module in each function. This helps me in debugging and examining output, much of which can be done from the run.py file (all the return values are stored in a dictionary). I've worked with a dataset with about 600,000 rows that had about 10 nodes. My computer has 32GB of RAM (Windows 11); it handled memory fine (less than half). For really big data, keeping all these dictionaries in memory might be a problem.

# python 3.12

"""
Hamilton demo.
"""

import sys

import pprint

from hamilton import driver

import toyscriptiii as ts

dr = driver.Builder().with_modules(ts).build()

dr.display_all_functions("ts.png", deduplicate_inputs=True, keep_dot=True, orient='BR')

results = dr.execute(['parsed_data',
                      'data_with_wikipedia',
                      'data_with_company',
                      'info_output',
                      'commodity_word_counts',
                      'colloquial_company_word_counts',
                      'info_dict_merged',
                      'wikipedia_report'],
                      inputs={'datafile':'data.csv'})

pprint.pprint(results['info_dict_merged'])
print(results['info_output'])
print(results['wikipedia_report'])

The main toy module with functions configured for the Hamilton graph:

# python 3.12

"""
Toy script.

Takes some input from a csv file on big American
mines and looks at Wikipedia text for some extra
context.
"""

import copy

import pprint

import sys

from urllib import request

import re

from bs4 import BeautifulSoup

def parsed_data(datafile:str) -> dict:
    """
    Get csv data into a dictionary keyed on mine name.
    """
    retval = {}
    with open(datafile, 'r') as f:
        headers = [x.strip() for x in next(f).split(',')]
        for linex in f:
            vals = [x.strip() for x in linex.split(',')]
            retval[vals[0]] = {key:val for key, val in zip(headers, vals)} 
    pprint.pprint(retval)
    return retval
        
def data_with_wikipedia(parsed_data:dict) -> dict:
    """
    Connect to wikipedia sites and fill in
    raw html data.

    Return dictionary.
    """
    retval = copy.deepcopy(parsed_data)
    for minex in retval:
        obj = request.urlopen(retval[minex]['wikipedia page'])
        html = obj.read()
        soup = BeautifulSoup(html, 'html.parser')
        print(soup.title)
        # Text from html and strip out newlines.
        newstring = soup.get_text().replace('\n', '')
        retval[minex]['wikipediatext'] = newstring
    return retval

def data_with_company(data_with_wikipedia:dict) -> dict:
    """
    Fetches company ownership for mine out of 
    Wikipedia text dump.

    Returns a new dictionary with the company name
    without the big wikipedia text dump.
    """
    # Wikipedia setup for mine company name.
    COMPANYPAT = r'[a-z]Company'
    # Lower case followed by upper case heuristic.
    ENDCOMPANYPAT = '[a-z][A-Z]'
    retval = copy.deepcopy(data_with_wikipedia)
    companypat = re.compile(COMPANYPAT)
    endcompanypat = re.compile(ENDCOMPANYPAT) 
    for minex in retval:
        print(minex)
        match = re.search(companypat, retval[minex]['wikipediatext'])
        if match:
            print('Company match span = ', match.span())
            companyidx = match.span()[1]
            match2 = re.search(endcompanypat, retval[minex]['wikipediatext'][companyidx:])
            print('End Company match span = ', match2.span())
            retval[minex]['company'] = retval[minex]['wikipediatext'][companyidx:companyidx + match2.span()[0] + 1]
        # Get rid of big text dump in return value.
        retval[minex].pop('wikipediatext')
    return retval

def info_output(data_with_company:dict) -> str:
    """
    Prints some output text to a file for each
    mine in the data_with_company dictionary.

    Returns string filename of output.
    """
    INFOLINEFMT = 'The {mine:s} mine is a big {commodity:s} mine in the State of {state:s} in the US.'
    COMPANYLINEFMT = '\n    {company:s} owns the mine.\n\n'
    retval = 'mine_info.txt'
    with open(retval, 'w') as f:
        for minex in data_with_company:
            print(INFOLINEFMT.format(**data_with_company[minex]), file=f)
            print(COMPANYLINEFMT.format(**data_with_company[minex]), file=f)
    return retval

def commodity_word_counts(data_with_wikipedia:dict, data_with_company:dict) -> dict:
    """
    Return dictionary keyed on mine with counts of
    commodity (e.g., zinc etc.) mentions on Wikipedia
    page (excluding ones in the company name).
    """
    retval = {}
    # This will probably miss some occurrences at mashed together
    # word boundaries. It is a rough estimate.
    # '\b[Gg]old\b'
    commoditypatfmt = r'\b[{0:s}{1:s}]{2:s}\b'
    for minex in data_with_wikipedia:
        print(minex)
        commodityuc = data_with_wikipedia[minex]['commodity'][0].upper()
        commoditypat = commoditypatfmt.format(commodityuc,
                                              data_with_wikipedia[minex]['commodity'][0],
                                              data_with_wikipedia[minex]['commodity'][1:])
        print(commoditypat)
        commoditymatches = re.findall(commoditypat, data_with_wikipedia[minex]['wikipediatext'])
        # pprint.pprint(commoditymatches)
        nummatchesraw = len(commoditymatches)
        print('Initial length of commoditymatches is {0:d}.'.format(nummatchesraw))
        companymatches = re.findall(data_with_company[minex]['company'],
                                    data_with_wikipedia[minex]['wikipediatext'])
        numcompanymatches = len(companymatches)
        print('Length of companymatches is {0:d}.'.format(numcompanymatches))
        # Is the commodity name part of the company name?
        print('commoditypat = ', commoditypat)
        print(data_with_company[minex]['company'])
        commoditymatchcompany = re.search(commoditypat, data_with_company[minex]['company'])
        if commoditymatchcompany:
            print('commoditymatchcompany.span() = ', commoditymatchcompany.span())
            nummatchesfinal = nummatchesraw - numcompanymatches
            retval[minex] = nummatchesfinal 
        else:
            retval[minex] = nummatchesraw 
    return retval

def colloquial_company_word_counts(data_with_wikipedia:dict) -> dict:
    """
    Find the number of times the company you associate with
    the property/mine (very subjective) is within the
    text of the mine's wikipedia article.
    """
    retval = {}
    for minex in data_with_wikipedia:
        colloquial_pat = data_with_wikipedia[minex]['colloquial association']
        print(minex)
        nummatches = len(re.findall(colloquial_pat, data_with_wikipedia[minex]['wikipediatext']))
        print('{0:d} matches for colloquial association {1:s}.'.format(nummatches, colloquial_pat))
        retval[minex] = nummatches
    return retval

def info_dict_merged(data_with_company:dict,
                     commodity_word_counts:dict,
                     colloquial_company_word_counts:dict) -> dict:
    """
    Get a dictionary with all the collected information
    in it minus the big Wikipedia text dump.
    """
    retval = copy.deepcopy(data_with_company)
    for minex in retval:
        retval[minex]['colloquial association count'] = colloquial_company_word_counts[minex]
        retval[minex]['commodity word count'] = commodity_word_counts[minex]
    return retval

def wikipedia_report(info_dict_merged:dict) -> str:
    """
    Writes out Wikipedia information (word counts)
    to file in prose; returns string filename.
    """
    retval = 'wikipedia_info.txt'
    colloqfmt = 'The {0:s} mine has {1:d} occurrences of colloquial association {2:s} in its Wikipedia article text.\n'
    commodfmt = 'The {0:s} mine has {1:d} occurrences of commodity name {2:s} in its Wikipedia article text.\n\n'
    with open(retval, 'w') as f:
        for minex in info_dict_merged:
            print(colloqfmt.format(info_dict_merged[minex]['mine'],
                                   info_dict_merged[minex]['colloquial association count'],
                                   info_dict_merged[minex]['colloquial association']), file=f)
            print(commodfmt.format(info_dict_merged[minex]['mine'],
                                   info_dict_merged[minex]['commodity word count'],
                                   info_dict_merged[minex]['commodity']), file=f)
    return retval

My REGEX abilities are somewhere between "I've heard the term REGEX and know regular expressions exist" and bracketed characters in each slot brute force. It worked for this toy example. Each Wikipedia page features the word "Company" followed by the name of the owning corporate entity.

Here is are the two text outputs the script produces from the information provided (Wikipedia articles from July, 2024):

The Red Dog mine is a big zinc mine in the State of Alaska in the US.

    NANA Regional Corporation owns the mine.


The Goldstrike mine is a big gold mine in the State of Nevada in the US.

    Barrick Gold owns the mine.


The Bingham Canyon mine is a big copper mine in the State of Utah in the US.

    Rio Tinto Group owns the mine.

The Red Dog mine has 21 occurrences of colloquial association Teck in its Wikipedia article text.

The Red Dog mine has 29 occurrences of commodity name zinc in its Wikipedia article text.


The Goldstrike mine has 0 occurrences of colloquial association Nevada Gold Mines in its Wikipedia article text.

The Goldstrike mine has 16 occurrences of commodity name gold in its Wikipedia article text.


The Bingham Canyon mine has 49 occurrences of colloquial association Kennecott in its Wikipedia article text.

The Bingham Canyon mine has 84 occurrences of commodity name copper in its Wikipedia article text.

Company names are relatively straightforward, although mining company and properties acquisitions and mergers being what they are, it can get complicated. I unwittingly chose three properties that Wikipedia reports as having one owner. Other big mines like Morenci, Arizona (copper) and Cortez, Nevada (gold) show more than one owner; that case is for another programming day. The Goldstrike information might be out of date - no mention of Nevada Gold Mines or Newmont (one mention, but in a different context). The Cortez Wikipedia page is more current, although it still doesn't mention Nevada Gold Mines.

The inclusion of colloquial association in the input csv file was an afterthought based on a lot of the Wikipedia information not being completely in line with what I thought I knew. Teck is the operator of the Red Dog Mine in Alaska. That name does get mentioned frequently in the Wikipedia article.

Enough mining stuff - it is a programming blog after all. Next time (not written yet) I hope to cover dressing up and highlighting the graphviz output a bit.

Thank you for stopping by.