pyright: 2016

Thursday, August 11, 2016

sqlcmd faux csv dump and parsing with the csv module

Lately I had another Excel-VBA-Python one off hack project. Once again there was the dilemma of not being able to use MSSQL's bcp because my query string was too long. sqlcmd can run a query from a big SQL file, but, to the best of my knowledge, it does not do csv dumps.

This is a hack. I would normally go to hell for it, but I've done so many other bad hacks I'd have to declare bankruptcy on my programming soul and start over. Onward.

mssql query file:

<SQL code>

< . . . variable declarations, temp table declarations, etc. . . . >

DECLARE @COMMA CHAR(1) = ',';
DECLARE @LOSSLESS INT = 3;
DECLARE @DOUBLEQUOTE CHAR(1) = CHAR(34);
-- Concatenate strings.
-- Need quoted strings for stockpiles with spaces.
SELECT @DOUBLEQUOTE + StockpileShortName +
       @DOUBLEQUOTE + @COMMA +
       @DOUBLEQUOTE + StockpileID +
       @DOUBLEQUOTE + @COMMA +
       @DOUBLEQUOTE + StkLoc +
       @DOUBLEQUOTE + @COMMA +
       -- Go for full float precision.
       CONVERT(VARCHAR(35),
               tonnes,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade01,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade02,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade03,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade04,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade05,
               @LOSSLESS) + @COMMA +
       CONVERT(VARCHAR(35),
               grade06,
               @LOSSLESS)
FROM ##inputresultspvctrachte

< . . . ORDER BY clause . . .>
<End SQL code>

It's pretty obvious what I'm doing (and I'd be shocked if I'm the first to do it): list all my fields on one line separated by commas that are part of the result record.

A couple notes:

1) all my string identifiers are in double quotes; all my float values are in unquoted text - this will help simplify the Python csv module code below.

2) the @LOSSLESS "constant" - Microsoft's SQL documentation doesn't list an enumeration for this per se. It's just a straight up whole number 3. I'm a bit obsessive about constants - wrap that baby in a variable declaration! Lossless double precision means, if I recall correctly, SQL Server will give you seventeen digits of precision. This works for what I'm doing (mining stockpile management).

The (rough) mssql command to run the query from a DOS prompt:

sqlcmd -S MYSERVERNAME -U MYUSERNAME -P MYPASSWORD -I myqueryfile.sql -o theoutputfile.csv -b

The -b switch provides a Windows error code. It's a crude check for whether the query parsed OK and ran, but it's better than nothing.

The output looks something like this (sorry about the small font):

<. . . sqlcmd messages . . .>

"KEY003","hakunamatadacopper","good",28776.5,X.XXXXX,X.XXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY005","tembomalachite","not as good",25855.9,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY006","simbacobalt","not as good",156767,X.XXXXXX,X.XXXXXXX,X.XXXXXX,X.XXXXXXX,XX.XXXX,X.XXXXXX
"KEY010","jambocobalt","good",488977,X.XXXXX,X.XXXXXX,X.XXXX,X.XXXXXX,XXX.XXX,X.XXXXX
"KEY015","cucoagogo","good",39576.7,X.XXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY016","greenrock","good",160,X.XXX,X.XXX,X.XXX,X.XXX,XXX.XX,X.XX
"KEY033","pinkrock","not as good",81504.3,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXX
"KEY006","funkyleach","not as good",55866.1,X.XXXXXX,X.XXXXXX,X.XXXXXX,X.XXXXXX,XXX.XXX,X.XXXXXX
"KEY010","metalhome","good",30301.1,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XX,X.XXXXX
"KEY015","boulderpile","good",2878.25,X.XX,X.XX,X.XXX,X.XXX,XX.XXX,X.XXX
"KEY033","berm","not as good",5309.97,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXXX
(11 rows affected)

I've given my stockpiles funny names and X'ed out the numeric grades to sanitize this, but you get the general idea.

Now, finally to some Python code. I'll get the lines of the file (faux csv) I want and parse them with the csv module reader object. The whole deal is kind of verbose (I have a collections.namedtuple object that takes each "column" as an attribute). I'm only going to show the part that segregates the lines I want and reads them with the csv reader. The wpx module has all of my constants and static data definition in it. Some of the whitespace issues I still need to work out. For now I brute force stripped off leading and trailing whitespace from values.

def parsesqlcmdoutput():
    """
    Parse output from sqlcmd.
    Returns list of
    collections.namedtuple
    objects.
    """
    lines = []
    with open(wpx.OUTPUTFILE +
              wpx.CSVEXT, 'r') as f:
        # Get relevant lines.
        # Rip whitespace off end - excessive.
        # XXX - string overloading - hack.
        lines = [linex.strip() for
                 linex in f if
                 linex[0:wpx.STKFLAG[0]] ==
                         wpx.STKFLAG[1]]
    rdr = csv.reader(lines, quoting =
                            QUOTENONN)
    records = []
    for r in rdr:
        # Get rid of whitespace padding
        # around string values.
        for x in xrange(wpx.IHSTRIDX):
            r[x] = r[x].strip()
        records.append(wpx.INPUTRECORD(*r))
    return records
That csv.QUOTENONN (quote non-numeric) is handy. As per the Python doc, anything that isn't quoted is taken as a float. As long as my data are clean, I should be good there and it strips out some cruft code-wise.

The list comprehension is an iterable object the same way a file is, so the csv module's reader works fine on it.

That's about it (minus a lot of background code - if you need that, let me know and I'll put it in the comments).

Thanks for stopping by.

Sunday, July 10, 2016

Using Generators and Coroutines to Merge Tabular Data (Drill Holes)

I have some mining drill hole data that I need to merge into an old vendor FORTRAN input format. Basically I do a series of SQL pulls from the drillhole database to csv files, then merge the data. My methodology has been a bit brute force in matching the separate parts of the drill hole data (lists, opening and closing of files to find matching holes, etc.). My thought was that I could do this more elegantly and efficiently by iterating through the files with generators.

The ability of generators to communicate with each other via the send() method intrigued me. I had always been a bit shy about using this language feature. My csv problem gave me a justification for checking it out.

The reference I used was Dr. Dave Beazley's 2009 Pycon Tutorial. He does a nice job of explaining things as well as dispatching good advice. (I disobeyed the good advice in the interest of shoehorning coroutines into my solution; I'll cover this below.) Beazley defines a coroutine in the sense of generators and the "yield" keyword as generators where "yield" is used more generally. That is the context I'm using the word "coroutine" in this post.

Given my problem of a one (drill hole start survey) to many (drill hole interval values) relationship, I attempted a very simple (perhaps oversimplified) toy program demo of what I wanted to do with real data:

def coroutinex(subgenerator):
    """
    Generator function that consumes
    a key value sent from a higher
    level generator. This generator
    yields two tuples of the form
    (<boolean>, data). The boolean
    value indicates whether the key
    matches the data.

    Returns a generator.
    """
    while True:
        # One entry point for send()/reset.
        keyx = yield
        subdatatop = next(subgenerator)
        if subdatatop[0] == keyx:
            yield (True, subdatatop)
            for subdataloop in subgenerator:
                if subdataloop[0] == keyx:
                    yield (True, subdataloop)
                else:
                    yield (False, subdataloop)
                    break

def toplevelgen(topleveliter, coroutinex):
    """
    Top level generator function.

    subgenerator is a generator
    that this generator sends
    a key value to. The
    subgenerator yields a two
    tuple that communicates if
    the key matches or not.

    Returns a generator.
    """
    # Get sub generator/coroutine initialized.
    coroutinex.send(None)
    # Variable for dealing with return
    # from sub-generator/coroutine.
    subvalue = False
    for keyx in topleveliter:
        yield keyx
        if subvalue:
            yield subvalue
        subvalue = coroutinex.send(keyx)
        # Get sub generator/coroutine re-initialized
        # after send() reset.
        if subvalue is None:
            # XXX - hack
            subvalue = coroutinex.send(keyx)
        yield subvalue
        for submessage in coroutinex:
            # XXX - another hack to deal with yield of None.
            if not submessage:
                continue
            subvalue = submessage
            # if submessage[0] is True, kick it out.
            if submessage[0]:
                yield submessage
            else:
                # Keep subvalue for after keyvalue
                # yield at top.
                break

topleveliter = range(44, 55)
keysx = [44, 44, 44, 45, 45, 45, 45, 45,
         46, 46, 46, 46, 46, 46, 46, 46,
         47, 47, 47, 48, 48, 48, 48, 48,
         49, 49, 49, 49, 49, 49, 50, 50,
         51, 51, 51, 51, 51, 51, 51, 51,
         52, 52, 52, 52, 52, 52, 52, 52,
         53, 53, 53, 53, 53, 53, 53, 53,
         54, 54, 54, 54, 54, 54, 54, 54]

sequencex = range(1, len(keysx) + 1)
subgenerator = zip(keysx, sequencex)

gensub = coroutinex(subgenerator)
genmain = toplevelgen(topleveliter, gensub)

for x in genmain:
     print(x)

Output:

44
(True, (44, 1))
(True, (44, 2))
(True, (44, 3))
45
(False, (45, 4))
(True, (45, 5))
(True, (45, 6))
(True, (45, 7))
(True, (45, 8))
46
(False, (46, 9))
(True, (46, 10))
(True, (46, 11))
(True, (46, 12))
(True, (46, 13))
(True, (46, 14))
(True, (46, 15))
(True, (46, 16))
47
(False, (47, 17))
(True, (47, 18))
(True, (47, 19))
48
(False, (48, 20))
(True, (48, 21))
(True, (48, 22))
(True, (48, 23))
(True, (48, 24))
49
(False, (49, 25))
(True, (49, 26))
(True, (49, 27))
(True, (49, 28))
(True, (49, 29))
(True, (49, 30))
50
(False, (50, 31))
(True, (50, 32))
51
(False, (51, 33))
(True, (51, 34))
(True, (51, 35))
(True, (51, 36))
(True, (51, 37))
(True, (51, 38))
(True, (51, 39))
(True, (51, 40))
52
(False, (52, 41))
(True, (52, 42))
(True, (52, 43))
(True, (52, 44))
(True, (52, 45))
(True, (52, 46))
(True, (52, 47))
(True, (52, 48))
53
(False, (53, 49))
(True, (53, 50))
(True, (53, 51))
(True, (53, 52))
(True, (53, 53))
(True, (53, 54))
(True, (53, 55))
(True, (53, 56))
54
(False, (54, 57))
(True, (54, 58))
(True, (54, 59))
(True, (54, 60))
(True, (54, 61))
(True, (54, 62))
(True, (54, 63))
(True, (54, 64))

Back to Dr. Beazley's advice - he doesn't recommend this - even though "yield" is the keyword, it means two different things in two different contexts. Do not mix generator and coroutine functionality. I'm going ahead in this post and doing it anyway. I don't have an excuse. It does remind me of some old Bob Dylan lyrics:

Now the rainman gave me two cures
Then he said, "Jump right in"
The one was Texas medicine
The other was just railroad gin
An' like a fool I mixed them
An' it strangled up my mind

It's OK, Bob, some of us just need to learn things the hard way.

Onward.

A brief diversion on drill holes - the data for a small scale (about 2,000 feet or less) geotechnical or gelogic drill hole come back in three parts:

1) collar - where the hole starts in space (coordinates).

2) surveys - where the hole ends up going in space relative to the collar (drill pipe has proven to be amazingly flexible when passing through rock).

3) assays - usually the hole is sampled along intervals and chemically or physically analyzed. The assay intervals may or may not coincide with survey intervals.

Clear as (drilling) mud? Great - back to Python.

The problem:

Three tabular csv dumps from SQL - a collar file, a survey file, and an assay file. Each has a unique key in the first column that matches across files (the drill hole key). On the SQL side I have ensured that there are no orphan key rows in any of the three files and that all three are sorted on the key.

I present the sanitized ouput here first - it will give some context to the domain specific parts of the code:

XXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,26.4529
XXXXX,0.0000,1.1925,1.1925,283.5688,-13.5310
XXXXX,1.1925,4.2760,3.0836,284.6224,1.9328      SURVEYS
XXXXX,4.2760,6.3799,2.1039,280.2829,-3.1334       GO
XXXXX,6.3799,9.7024,3.3225,282.5794,2.3632       HERE
XXXXX,9.7024,11.8701,2.1677,285.4406,-1.1631     AFTER
XXXXX,11.8701,13.6920,1.8219,275.9462,-5.0698    COLLAR
XXXXX,13.6920,17.1199,3.4279,285.4561,1.9560    LOCATION
XXXXX,17.1199,19.6944,2.5746,279.2318,-0.7344
XXXXX,19.6944,22.5857,2.8913,282.1947,4.3241
XXXXX,22.5857,24.1879,1.6022,283.8367,-1.7525
XXXXX,24.1879,26.4529,2.2650,287.3820,13.4805
XXXXX                             <----- LEGACY DRILLHOLE NUMBER
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.            ASSAYS
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.              GO
XXXXX,X.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.            HERE
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
                                                <----- BLANK LINE
XXXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,23.5411
XXXXXX,0.0000,2.5781,2.5781,135.0157,2.3341
XXXXXX,2.5781,5.0351,2.4570,137.1873,5.5353
XXXXXX,5.0351,7.3706,2.3354,135.2276,7.7020
XXXXXX,7.3706,9.9168,2.5462,136.4253,6.4493
                .
                .
                .
                .
                .
                .
                .
               etc.

And the code (sorry about the size - it got messier than I would have hoped):

#!C:\Python35\python

"""
Parse collar, survey, and assay dumps for
trenches from vendor drill hole RDBMS.

Write specially formatted data file for
consumption by old vendor FORTRAN
routine 201.
"""

import csv
from collections import namedtuple
from collections import OrderedDict

COLLAR = './data/collar.csv'
SURVEY = './data/survey.csv'
ASSAYS = './data/assays.csv'
DAT201 = './data/TR.dat'

# collar (ssit) fields
ID = 'drillholeid'
NAME = 'drillholename'
DATE = 'drillholedate'
LEGACY = 'drillholehistoricname'
X = 'collarx'
Y = 'collary'
Z = 'collarz'
AZ = 'azimuth'
DIP = 'dip'
LEN = 'drillholelength'

COLLARFIELDS = [ID, NAME, DATE, LEGACY, X, Y, Z,
                AZ, DIP, LEN]

# survey fields
FROM = 'fromx'
TO = 'depthto'
SAMPLEN = 'surveylength'
AZ = 'azimuth'
DIP = 'dip'

SURVEYFIELDS = [ID, NAME, DATE, LEGACY, FROM, TO,
                SAMPLEN, AZ, DIP]

# assay fields
AFROM = 'assayfrom'
ATO = 'assayto'
AI = 'assayinterval'
ASSAY1 = 'assay1'
ASSAY2 = 'assay2'
ASSAY3 = 'assay3'
ASSAY4 = 'assay4'
ASSAY5 = 'assay5'
ASSAY6 = 'assay6'
ASSAY7 = 'assay7'
ASSAY8 = 'assay8'

ASSAYFIELDS = [ID, NAME, LEGACY, AFROM, ATO, AI, ASSAY1,
               ASSAY2, ASSAY3, ASSAY4, ASSAY5, ASSAY6, ASSAY7, ASSAY8]

ASSAYFORMAT = '.2f'
SURVEYFORMAT = '.4f'

COMMA = ','

# Output for 201 file format.
# Collars.
COLOUTPUTCOLS = [X, Y, Z, AZ, DIP, LEN]
COLFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in COLOUTPUTCOLS]
# Surveys.
SURVOUTPUTCOLS = [FROM, TO, SAMPLEN, AZ, DIP]
SURVFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in SURVOUTPUTCOLS]
# Assays.
ASSYOUTPUTCOLS = [AFROM, ATO, AI, ASSAY1, ASSAY2, ASSAY3, ASSAY4, ASSAY5,
                  ASSAY6, ASSAY7, ASSAY8]
ASSYOUTPUTFMTS = 3 * [SURVEYFORMAT] + 8 * [ASSAYFORMAT]
# Have to use this repeatedly - hence list.
ASSYFMTOUTPUT = list(zip(ASSYOUTPUTCOLS, ASSYOUTPUTFMTS))

RETCHAR = '\n'

# For tracking which dataset we're
# dealing with.
SURVEYSUBDATA = 'survey'
ASSAYSUBDATA = 'assay'

# For survey/assay dictionary.
COR = 'coroutine'
FMT = 'format'
LAST = 'lastvalue'
END = 'end'

INFOMESSAGE = 'Now doing hole number {0} . . .'

def makecsvdatagenerator(csvrdr, ntname, ntfields):
    """
    Returns a generator that yields csv
    row records as named tuple objects.

    csvrdr is the csv.reader object.

    ntname is the name given to the
    collections.namedtuple object.

    ntfields is the list of field names
    for the collections.namedtuple object.
    """
    namedtup = namedtuple(ntname, ntfields)
    return (namedtup(*linex) for linex in csvrdr)

def formatassay(numstring, formatx):
    """
    Returns a string representing a float
    that typically is in 0.00 format, but
    other float formats can be applied.

    numstring is a string representing a float.

    formatx is the desired format (Python 3 format string).
    """
    return(format(float(numstring), formatx))

def getnumericstrings(record, formats):
    """
    Returns list of strings.

    record is a collections.namedtuple instance.

    formats is a list of two-lists of namedtuple
    attributes and numeric string formats to be
    applied to each attribute's value.
    """
    return [formatassay(record.__getattribute__(pairx[0]),
                                                pairx[1])
            for pairx in formats]

def coroutinex(subgenerator):
    """
    Generator function.

    Consumes key value and yields
    two tuple of (<boolean>,
    next(subgenerator)) in response.
    boolean value indicates
    whether key matches first
    value of subgenerator namedtuple.

    subgenerator is a generator of
    namedtuples.

    Returns a generator.
    """
    while True:
        keyx = yield
        subdatatop = next(subgenerator)
        if subdatatop.drillholeid == keyx:
            yield (True, subdatatop)
            for subdataloop in subgenerator:
                if subdataloop.drillholeid == keyx:
                    yield (True, subdataloop)
                else:
                    yield (False, subdataloop)
                    break
        # Case where only one interval in
        # drill hole.
        else:
            yield (False, subdatatop)

def formatdataline(record, formats):
    """
    Prepare record as a line
    of text for write to file.

    record is a collections.namedtuple
    object.

    formats is a list of two tuples of
    namedtuple attributes and numeric
    string formats.

    Returns string.
    """
    recordline = [record.drillholehistoricname]
    recordline.extend(getnumericstrings(record,
                                        formats))
    return COMMA.join(recordline) + RETCHAR

def dealwithsend(subgen, sendval):
    """
    Helper function to clean up code.
    Deals with initial receipt of
    None value upon send() and
    re-sends value.

    Sends value sendval to
    generator/coroutine subgen.

    Returns two tuple of (<boolean>,
    <collections.namedtuple>).
    """
    retval = subgen.send(sendval)
    if retval is None:
        retval = subgen.send(sendval)
    return retval

def dealwithyieldrecord(survassay, subdata):
    """
    Helper function to clean up code.

    Formats values for write to file.

    survassay is a dictionary of values.

    subdata is the dictionary key that
    tells which data is being handled
    (survey or assay).
    """
    return formatdataline(survassay[subdata][LAST][1],
                          survassay[subdata][FMT])

def cyclecollars(collargen,
                 survassay):
    """
    Generator function that yields
    data (strings) for write to a
    a specially formatted drill hole
    file.

    This is the top level generator
    for working the merging of
    drillhole data (collars, surveys,
    assays).

    survassay is a collections.OrderedDict
    object that references the respective
    survey and assay generators and holds
    information for tracking which subset
    of data (surveys or assays) are being
    worked.
    """
    for record in collargen:
        keyx = record.drillholeid
        label = record.drillholehistoricname
        survassay[SURVEYSUBDATA][END] = label + RETCHAR
        print(INFOMESSAGE.format(label))
        yield formatdataline(record, COLFMTOUTPUT)
        for subdata in survassay:
            fmt = survassay[subdata][FMT]
            if survassay[subdata][LAST]:
                yield dealwithyieldrecord(survassay, subdata)
            subvalue = dealwithsend(survassay[subdata][COR], keyx)
            # Case where only one interval.
            if not subvalue[0]:
                survassay[subdata][LAST] = subvalue
                yield survassay[subdata][END]
                continue
            yield formatdataline(subvalue[1], fmt)
            for submessage in survassay[subdata][COR]:
                # End of iteration.
                if submessage is None:
                    yield survassay[subdata][END]
                    break
                if submessage[0]:
                    yield formatdataline(submessage[1], fmt)
                else:
                    survassay[subdata][LAST] = submessage
                    yield survassay[subdata][END]
                    break

def main():
    """
    Parse csv dumps from SQL and write
    drillhole data fields for import
    to old vendor FORTRAN based binary
    files.

    Side effect function.
    """
    with open(COLLAR, 'r') as colx:
        colcsv = csv.reader(colx)
        collargen = makecsvdatagenerator(colcsv,
                                         'collars',
                                         COLLARFIELDS)
        with open(SURVEY, 'r') as svgx:
            survcsv = csv.reader(svgx)
            survgen = makecsvdatagenerator(survcsv,
                                           'surveys',
                                           SURVEYFIELDS)
            surveycoroutinex = coroutinex(survgen)
            with open(ASSAYS, 'r') as assx:
                assycsv = csv.reader(assx)
                assygen = makecsvdatagenerator(assycsv,
                                               'assays',
                                               ASSAYFIELDS)
                assaycoroutinex = coroutinex(assygen)
                with open(DAT201, 'w') as d201:
                    # Get sub generators/coroutines initialized.
                    surveycoroutinex.send(None)
                    assaycoroutinex.send(None)
                    surveyassay = OrderedDict()
                    surveyassay[SURVEYSUBDATA] = {COR:surveycoroutinex,
                                                  FMT:SURVFMTOUTPUT,
                                                  LAST:None,
                                                  END:None}
                    surveyassay[ASSAYSUBDATA] = {COR:assaycoroutinex,
                                                 FMT:ASSYFMTOUTPUT,
                                                 LAST:None,
                                                 END:RETCHAR}
                    colgenx = cyclecollars(collargen,
                                           surveyassay)
                    for linex in colgenx:
                        d201.write(linex)
    print('Done')

if __name__ == '__main__':
    main()

The bad news: this was more difficult with a real world dataset than I anticipated. Beazley's admonition was an apt one.

The good news: it does perform better than my previous brute force implementations. From the standpoint of iterating through datasets and not wasting resources (even with the polling or interrupting or whatever facilitates the generator communication closer to the metal), this is a better implementation. Also, I learned a bit more about the "yield" keyword.

Thanks for stopping by.

Monday, April 18, 2016

7-Zip-JBinding API with jython on Windows

I have a set of multi-GB Windows folders that I need to archive in 7-zip format each month.  I'd prefer not to use the mouse to compress the folders "manually."  Also, I didn't want to use the command line with the subprocess module like I have with some other programs. Ideally, I wanted to control 7zip programmatically. The 7-Zip-JBinding libraries offered a means to do this from jython.

7-Zip-JBinding is written using java Interfaces that are structured pretty specifically. I did not venture too far away from the examples given in the 7-Zip-JBinding documentation. I smithed two modules for my own purposes, compressing and uncompressing, and present them (java code) below. The decompression one has a separate method for retrieving paths of the compressed files. This is not efficient, but for what I need to do, and for the limitations of the library and the approach, it works out for the best.

import java.io.IOException;
import java.io.RandomAccessFile;
import net.sf.sevenzipjbinding.IOutCreateArchive7z;
import net.sf.sevenzipjbinding.IOutCreateCallback;
import net.sf.sevenzipjbinding.IOutItem7z;
import net.sf.sevenzipjbinding.ISequentialInStream;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.OutItemFactory;
import net.sf.sevenzipjbinding.impl.RandomAccessFileOutStream;
import net.sf.sevenzipjbinding.util.ByteArrayStream;

/* Off StackOverflow - works for getting
* file content/bytes from path */
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;

public class SevenZipThing {

    private static final String RETCHAR = "\n";
    private static final String INTFMT = "%,d";
    private static final String BYTESTOCOMPRESS = " bytes total to compress\n";
    private static final String ERROCCURS = "Error occurs: ";
    private static final String COMPRESSFILE = "\nCompressing file ";
    private static final String RW = "rw";
    private static final int LVL = 5;
    private static final String SEVZERR = "7z-Error occurs:";
    private static final String ERRCLOSING = "Error closing archive: ";
    private static final String ERRCLOSINGFLE = "Error closing file: ";
    private static final String SUCCESS = "\nCompression operation succeeded\n";

    private String filename;
    /* String[] array conversion from jython list
     * implicit and poses no problems (JKD7) */
    private String[] pathsx;

    public SevenZipThing(String filename, String[] pathsx) {
        this.filename = filename;
        this.pathsx = pathsx;
    }

    /**
     * The callback provides information about archive items.
     */
    /**

   * I copied this straight from the sevenZipJBinding's author's
     * code - but I haven't put much in to deal with messaging
     * or error handling
   * */
    private final class MyCreateCallback
            implements IOutCreateCallback<IOutItem7z> {

        public void setOperationResult(boolean operationResultOk)
                throws SevenZipException {
            // Track each operation result here
        }

        public void setTotal(long total) throws SevenZipException {
            // Track operation progress here

            System.out.print(RETCHAR + String.format(INTFMT, total) +
                     BYTESTOCOMPRESS);
        }

        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }

        public IOutItem7z getItemInformation(int index,
                OutItemFactory<IOutItem7z> outItemFactory) {
            IOutItem7z item = outItemFactory.createOutItem();
            Path path = Paths.get(pathsx[index]);
            item.setPropertyPath(pathsx[index]);
            try {
                // Java arrays are limited to 2 ** 31 items - small.
                byte[] data = Files.readAllBytes(path);
                item.setDataSize((long) data.length);
                return item;
            // XXX - I could do a lot better than this (error handling).
            } catch (Exception e) {
                System.err.println(ERROCCURS + e);
            }
            return null;
        }

        public ISequentialInStream getStream(int i)
            throws SevenZipException {
            Path path = Paths.get(pathsx[i]);
            try {
                byte[] data = Files.readAllBytes(path);
                System.out.println(COMPRESSFILE + path);
                return new ByteArrayStream(data, true);
            } catch (Exception e) {
                System.err.println(ERROCCURS + e);
            }
            return null;
        }
    }

    public void compress() {

        /* Mostly copied from sevenZipJBinding's author's code -
         * I made the compress method public to work from jython.
         * Also, I deal with all of the file listing in jython
         * and just pass a list to this class. */
        boolean success = false;
        RandomAccessFile raf = null;
        IOutCreateArchive7z outArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);
            // Open out-archive object
            outArchive = SevenZip.openOutArchive7z();
            // Configure archive
            outArchive.setLevel(LVL);
            outArchive.setSolid(true);
      // All available processors.
      outArchive.setThreadCount(0);
            // Create archive
            outArchive.createArchive(new RandomAccessFileOutStream(raf),
                    pathsx.length, new MyCreateCallback());
            success = true;
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (outArchive != null) {
                try {
                    outArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                    success = false;
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                    success = false;
                }
            }
        }
        if (success) {
            System.out.println(SUCCESS);
        }
    }
}

import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.util.Arrays;
import java.util.ArrayList;
import net.sf.sevenzipjbinding.IInArchive;
import net.sf.sevenzipjbinding.PropID;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.RandomAccessFileInStream;
import net.sf.sevenzipjbinding.IArchiveExtractCallback;
import net.sf.sevenzipjbinding.ExtractOperationResult;
import net.sf.sevenzipjbinding.ExtractAskMode;
import net.sf.sevenzipjbinding.ISequentialOutStream;
/* 7z archive format */
/* SEVEN_ZIP is the one I want */
import net.sf.sevenzipjbinding.ArchiveFormat;

public class SevenZipThingExtract {
    private String filename;
    private String extractdirectory;
    private ArrayList<String> foldersx = null;
  private boolean subdirectory = false;
    private static final String ERROPENINGFLE = "Error opening file: ";
    private static final String ERRWRITINGFLE = "Error writing to file: ";
    private static final String EXTERR = "Extraction error";
    private static final String INFOFMT = "%9X | %10s | %s";
    private static final String RETCHAR = "\n";
    private static final String INTFMT = "%,d";
    private static final String BYTESTOEXTRACT = " bytes total to extract\n";
    private static final String RW = "rw";
    private static final String BACKSLASH = "\\";
    private static final String SEVZERR = "7z-Error occurs:";
    private static final String ERROCCURS = "Error occurs: ";
    private static final String ERRCLOSING = "Error closing archive: ";
    private static final String ERRCLOSINGFLE = "Error closing file: ";

    public SevenZipThingExtract(String filename, String extractdirectory,
                                boolean subdirectory) {
        this.filename = filename;
        foldersx = new ArrayList<String>();
        this.foldersx = foldersx;
        this.extractdirectory = extractdirectory;
        this.subdirectory = subdirectory;
    }

    private final class MyExtractCallback
            implements IArchiveExtractCallback {
        // Copied mostly from example.
        private int hash = 0;
        private int size = 0;
        private int index;
        private boolean skipExtraction;
        private IInArchive inArchive;
        private OutputStream outputStream;
        private File file;

        public MyExtractCallback(IInArchive inArchive) {
            this.inArchive = inArchive;
        }

        @Override
        public ISequentialOutStream getStream(int index,
                          ExtractAskMode extractAskMode)
                throws SevenZipException {

             this.index = index;
             // I'm not skipping anything.
             skipExtraction = (Boolean) false;
             String path = (String) inArchive.getProperty(index, PropID.PATH);
             // Try preprending extractdirectory.
             if (subdirectory) {
                 path = extractdirectory + BACKSLASH + path.substring(2);
             } else {
                 path = extractdirectory + BACKSLASH + path;
             }
             file = new File(path);
            try {
                outputStream = new FileOutputStream(file);
            } catch (FileNotFoundException e) {
                throw new SevenZipException(ERROPENINGFLE
                        + file.getAbsolutePath(), e);
            }
            return new ISequentialOutStream() {
                public int write(byte[] data) throws SevenZipException {
                   try {
                       outputStream.write(data);
                   } catch (IOException e) {
                       throw new SevenZipException(ERRWRITINGFLE
                               + file.getAbsolutePath());
                   }
                   return data.length; // Return amount of consumed data
                }
            };
       }

        public void prepareOperation(ExtractAskMode extractAskMode)
                throws SevenZipException {
        }
        public void setOperationResult(ExtractOperationResult extractOperationResult)
                throws SevenZipException {
            // Track each operation result here
            if (extractOperationResult != ExtractOperationResult.OK) {
                System.err.println(EXTERR);
            } else {
                System.out.println(String.format(INFOFMT, hash, size,//
                        inArchive.getProperty(index, PropID.PATH)));
                hash = 0;
                size = 0;
            }
        }

        public void setTotal(long total) throws SevenZipException {
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                             BYTESTOEXTRACT);
        }

        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }
    }

    private final class MyGetPathsCallback
            implements IArchiveExtractCallback {
        // Copied mostly from example.
        private int hash = 0;
        private int size = 0;
        private int index;
        private boolean skipExtraction;
        private IInArchive inArchive;
        public MyGetPathsCallback(IInArchive inArchive) {
            this.inArchive = inArchive;
        }
        public ISequentialOutStream getStream(int index,
            ExtractAskMode extractAskMode)
                throws SevenZipException {
             this.index = index;
             // I'm not skipping anything.
             skipExtraction = (Boolean) false;
             String path = (String) inArchive.getProperty(index,
                 PropID.PATH);
             foldersx.add(path);
             return new ISequentialOutStream() {
                public int write(byte[] data) throws SevenZipException {
                    hash ^= Arrays.hashCode(data);
                    size += data.length;
                    // Return amount of processed data
                    return data.length;
                }
            };
        }

        public void prepareOperation(ExtractAskMode extractAskMode)
                throws SevenZipException {
        }

        public void setOperationResult(ExtractOperationResult extractOperationResult)
                throws SevenZipException {
            // Track each operation result here
            if (extractOperationResult != ExtractOperationResult.OK) {
                System.err.println(EXTERR);
            } else {
                System.out.println(String.format(INFOFMT, hash, size,
                        inArchive.getProperty(index, PropID.PATH)));
                hash = 0;
                size = 0;
            }
        }

        public void setTotal(long total) throws SevenZipException {
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                BYTESTOEXTRACT);
        }

        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }
    }

    public void extractfiles() {

        boolean success = false;
        RandomAccessFile raf = null;
        IInArchive inArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);
            inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
                    new RandomAccessFileInStream(raf));
            int itemCount = inArchive.getNumberOfItems();

            // From StackOverflow - could use IntStream,
            // but that's Java 1.8 (using 1.7).
            int[] fileindices = new int[itemCount];
            for(int k = 0; k < fileindices.length; k++)
                fileindices[k] = k;
            inArchive.extract(fileindices, false,
                new MyExtractCallback(inArchive));
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (inArchive != null) {
                try {
                    inArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                }
            }
        }
    }

    public ArrayList<String> getfolders() {

        boolean success = false;
        RandomAccessFile raf = null;
        IInArchive inArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);
            inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
                    new RandomAccessFileInStream(raf));
            int itemCount = inArchive.getNumberOfItems();

            // From StackOverflow - could use IntStream,
            // but that's Java 1.8 (using 1.7).
            int[] fileindices = new int[itemCount];
            for(int k = 0; k < fileindices.length; k++)
                fileindices[k] = k;
            inArchive.extract(fileindices, false,
                new MyGetPathsCallback(inArchive));
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (inArchive != null) {
                try {
                    inArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                }
            }
        }
        return foldersx;
    }
}

The method getfolders in the SevenZipThingExtract class is the extra method to get the list of folders. As noted in the jython code below, the limitations on the number of bytes and files to be compressed necessitates splitting larger files into chunks. Also, for my specific use case, I need to extract files to a specific folder and set of subfolders. My methodology is outlined in the comments in the jython code. The good news: if I get run over by a bus and the uncompression part of the program gets lost, people will be able to get the files back with some effort. The bad news: they will be cursing my headstone. You do the best you can.

The three jython modules - the first one, folderstozip.py is just constants:

#!java -jar C:\jython-2.7.0\jython.jar

# folderstozip.py

"""
Constants used in compression and
decompression.
"""

FRONTSLASH = '/'
BACKSLASH = '\\'
EMPTY = ''
SAMEFOLDER = './'
SAMEFOLDERWIN = u'.\\'
SPLITFILETRACKER = 'SPLITFILETRACKER.csv'
SPLITFILE = '{0:s}.{1:s}'
UCOMMA = u','

# 3rd party sevenZipJBindings library.
PATH7ZJB = 'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJB += '/Backup/sevenzipjbinding/lib/sevenzipjbinding.jar'

# OS specific 3rd party sevenZipJBindings library.
PATH7ZJBOSSPEC = r'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJBOSSPEC += '/Backup/sevenzipjbinding/lib/sevenzipjbinding-Windows-amd64.jar'

PROGFOLDER = 'C:/MSPROJECTS/EOMReconciliation/2016/03March/Backup'
PROGFOLDER += FRONTSLASH

# Informational messages.
WROTEFILE = 'Wrote file {:s}\n'
SPLITFILEMSG = 'Have now split {0:,d} bytes of file {1:s} into {2:d} {3:,d} chunks.\n'
DONESPLITTING = '\nDone splitting file'
FILESAFTERSPLIT = '\n{:d} files after split'
COMPRESSING = '\nCompressing file {:s} . . .\n'
DELETING = '\nDeleting file {:s} . . .\n'
DELETINGDIR = '\nNow deleting {:s} . . .\n'

# Room for 9999 file names.
UNIQUEX = '{0:05d}'

# XXX - multiple file archives limited to
#       10KB - reason unknown - crashes jvm
#       with IInStream interface class not
#       found.
# XXX - choked on 8700 bytes - try dropping
#       this from 9500 to 8500.
MULTFILELIMIT = 8500
HALFLIMIT = MULTFILELIMIT/2
# About 50 splits for a 3GB file.
CHUNK = 2 ** 26

# Path plus split number.
FILEN = r'{0:s}.{1:03d}'
# Path plus basefilename.
FILEB = r'{0:s}{1:s}'

# Read/Write constants.
RB = 'rb'
WB = 'wb'
W = 'w'

# Filename plus split number.
ARCHIVEX = '{0:s}/{1:s}.7z'

# multifile archive
MULTARCHIVEX = '{0:s}/archive{1:03d}.7z'
MULTFILES = '. . . multiple files'

# File categories.
# Size less than HALFLIMIT.
SMALL = 'small'
# Size greater than or equal to HALFLIMIT but
# less than or equal to CHUNK.
MEDIUM = 'medium'
# Larger than CHUNK.
LARGE = 'large'

BASEPATH = 'basepath'

FILES = 'files'

# XXX - this folder has recognizable
#       folder names within your domain
#       space - mine are open pit mining
#       area names.
BASEDIRS = ['Pit-1', 'Pit-2', 'Pit-3']

#!java -jar C:/jython-2.7.0/jython.jar

# sevenzipper.py

"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to 7zip up MineSight project
files.
"""

import folderstozip as fld
# Need to adjust path to get necessary jar imports.
import sys
# Need for os.path
import os

# Original path of file plus split number.
SPLITFILERECORD = '{0:s},{1:03d}'

sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)

# java 7zip library
import SevenZipThing as z7thing

# For copying files to program
# directory and deleting the old
# ones where necessary.
import shutil
# For unique archive names.
import itertools

COUNTERX = itertools.count(0, 1)

def splitfile(originalfilepath, splitfilestrackerfile):
    """
    Split file at (string) originalfilepath
    into fld.CHUNK sized chunks and indicate
    sequence by number in new split file
    name.
    Return generator of relative file paths
    inside project folder.
    originalfilepath is the path of the
    file that needs to be split into parts.
    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.
    """
    sizeoffile = os.path.getsize(originalfilepath)
    chunks = sizeoffile/fld.CHUNK + 1
    # Counter.
    i = 1
    with open(originalfilepath, fld.RB) as f:
        while i < chunks + 1:
            with open(fld.FILEN.format(originalfilepath, i), fld.WB) as f2:
                f2.write(f.read(fld.CHUNK))
                print(fld.WROTEFILE.format(fld.FILEN.format(originalfilepath, i)))
                print(fld.SPLITFILEMSG.format(f.tell(), originalfilepath, i, fld.CHUNK))
                print >> splitfilestrackerfile, (SPLITFILERECORD.format(originalfilepath, i))
                i += 1
    print(fld.DONESPLITTING)
    print(fld.FILESAFTERSPLIT.format(i - 1))
    return (fld.FILEN.format(originalfilepath, x) for x in xrange(1, i))

def movefiles(movefilesx, intermediatepath):
    """
    Move files from MineSight project directory
    to program directory.
    Return a list of base file names for the
    moved files.
    movefilesx is a generator of file paths.
    intermediatepath is a string relative path
    between the program folder and the sub-folder
    of the MineSight directory (_msresources/06SOLIDS,
    for example).
    """
    # Move files to that folder.
    movedfiles = []
    for pathx in movefilesx:
        shutil.move(pathx, fld.PROGFOLDER + intermediatepath +
                    os.path.basename(pathx))
        movedfiles.append(intermediatepath + os.path.basename(pathx))
    return movedfiles

def copyfiles(copyfilesx, intermediatepath):
    """
    Copy files from MineSight project directory
    to program directory.
    Return a list of base file names for the
    copied files.
    copyfilesx is a generator of file paths.
    intermediatepath is a string relative path
    between the program folder and the sub-folder
    of the MineSight directory (_msresources/06SOLIDS,
    for example).
    """
    # Copy files to that folder.
    copiedfiles = []
    for pathx in copyfilesx:
        shutil.copyfile(pathx, fld.PROGFOLDER + intermediatepath +
                        os.path.basename(pathx))
        copiedfiles.append(intermediatepath + os.path.basename(pathx))
    return copiedfiles

def compressfilessingle(filestocompress, prefix, basedir):
    """
    Compresses files into an archive.
    This is for larger files that take up
    an entire archive (7z file).
    filestocompress is a list of paths of
    files to be compressed. These files
    reside inside the program directory.
    prefix is a string path addition, usually
    './' that allows the function to deal
    with relative paths for files that reside
    in subfolders.
    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).
    Side effect function.
    """
    for pathx in filestocompress:
        basename = os.path.split(pathx)[1]
        # Need unique name for subfolder files with same names.
        uniqueid = fld.UNIQUEX.format(COUNTERX.next())
        uniquename = uniqueid + basename
        print(fld.COMPRESSING.format(prefix + basename))
        archx = z7thing(fld.ARCHIVEX.format(basedir, uniquename),
                        [prefix + basename])
        archx.compress()

def compressfilesmultiple(filestocompress, indexx, basedir):
    """
    Compresses files into an archive.
    filestocompress is a list of paths of
    files to be compressed. These files
    reside inside the program directory.
    indexx is an integer that gives the
    archive a unique name.
    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).
    Side effect function.
    """
    print(fld.COMPRESSING.format(fld.MULTFILES))
    archx = z7thing(fld.MULTARCHIVEX.format(basedir, indexx),
                                            filestocompress)
    archx.compress()

def segregatefiles(directoryx, basefiles):
    """
    From a string directory path directoryx
    and a list of base file names, returns
    a dictionary of lists of files and their
    sizes sorted on size and keyed on file
    category.
    """
    retval = {}
    # Add separator to end of directory path.
    directoryx += fld.FRONTSLASH
    # Get all files in folder and their sizes.
    allfiles = [(os.path.getsize(fld.FILEB.format(directoryx, filex)), filex)
                 for filex in basefiles]
    retval[fld.SMALL] = [x for x in allfiles if x[0] < fld.HALFLIMIT]
    retval[fld.SMALL].sort()
    retval[fld.MEDIUM] = [x for x in allfiles if x[0] >= fld.HALFLIMIT and
                          x[0] <= fld.CHUNK]
    retval[fld.MEDIUM].sort()
    retval[fld.LARGE] = [x for x in allfiles if x[0] > fld.CHUNK]
    retval[fld.LARGE].sort()
    return retval

def deletefiles(movedfiles):
    """
    Delete files that have been compressed.
    movedfiles is a list of paths of
    files that have been moved or copied to
    the program directory for compression.
    Side effect function.
    """
    for pathx in movedfiles:
        print(fld.DELETING.format(pathx))
        os.remove(pathx)

def getsmallfilegroupings(smallfiles):
    """
    Generator function that yields
    a list of files whose sum is
    less than the program's limit
    for bytes to be archived in a
    multiple file archive.
    smallfiles is a list of two tuples
    of (filesize in bytes, file path).
    """
    lenx = len(smallfiles)
    insidecounter1 = 0
    insidecounter2 = 1
    sumx = 0
    while (insidecounter2 < (lenx + 1)):
        sumx = sum(x[0] for x in smallfiles[insidecounter1:insidecounter2])
        if sumx > fld.MULTFILELIMIT:
            # Back up one.
            insidecounter2 -= 1
            yield (x[1] for x in smallfiles[insidecounter1:insidecounter2])
            # Reset and advance counters.
            sumx = 0
            insidecounter1 = insidecounter2 + 1
            insidecounter2 = insidecounter1 + 1
        else:
            insidecounter2 += 1

def compresslargefiles(largefiles, dirx, prefix, basedir, splitfilestrackerfile):
    """
    Deal with compression of files that need to
    be split prior to compression.
    largefiles is a list of two tuples of file
    sizes and names.
    dirx is the directory (str) in which the files
    are located.
    prefix is a string prefix to augment path
    identification for compression.
    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).
    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.
    Side effect function.
    """
    for filex in largefiles:
        # Get generator of paths of splits.
        splitfiles = splitfile(fld.FILEB.format(dirx, filex[1]),
                               splitfilestrackerfile)
        movedfiles = movefiles(splitfiles, prefix)
        compressfilessingle(movedfiles, prefix, basedir)
        deletefiles(movedfiles)
def compressmediumfiles(mediumfiles, dirx, prefix, basedir):
    """
    Deal with compression of files that need to
    be compressed each to its own archive.
    mediumfiles is a list of two tuples of file
    sizes and paths.
    dirx is the directory (str) in which the files
    are located.
    prefix is a string prefix to augment path
    identification for compression.
    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).
    Side effect function.
    """
    filestocopy = (dirx + x[1] for x in mediumfiles)
    copiedfiles = copyfiles(filestocopy, prefix)
    compressfilessingle(copiedfiles, prefix, basedir)
    deletefiles(copiedfiles)
def compresssmallfiles(smallfiles, dirx, prefix, indexx, basedir):
    """
    Deal with compression of files that can be
    compressed in groups.
    mediumfiles is a list of two tuples of file
    sizes and paths.
    dirx is the directory (str) in which the files
    are located.
    prefix is a string prefix to augment path
    identification for compression.
    indexx is the current index that the 7zip
    file counter (ensures unique archive name)
    is on.
    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).
    Returns integer for current archive counter
    index.
    """
    smallgroupings = getsmallfilegroupings(smallfiles)
    while True:
        try:
            grouplittlefiles = smallgroupings.next()
            littlefiles = (dirx + x for x in grouplittlefiles)
            copiedfiles = copyfiles(littlefiles, prefix)
            compressfilesmultiple(copiedfiles, indexx, basedir)
            indexx += 1
            deletefiles(copiedfiles)
        except StopIteration:
            break
    return index

# XXX - hack
def matchbasedir(folderlist):
    """
    Get MineSight project folder name
    that matches a folder in the path
    in question.
    folderlist is a list (in order)
    of directories in a path.
    Returns string.
    """
    for folderx in folderlist:
        for projx in fld.BASEDIRS:
            if projx == folderx:
                return folderx
    return None

def getbasedir(pathx):
    """
    Returns two tuple of strings for
    basedir and basefolder (project
    directory name and base path under
    project directory copied to program
    directory).
    pathx is the directory path being
    processed (str).
    """
    # basedir is project name (Fwaulu, for example).
    foldernames = pathx.split(fld.FRONTSLASH)
    basedir = matchbasedir(foldernames)
    # Get directory under project directory.
    # _msresources, for example.
    idx = foldernames.index(basedir)
    # Directory under program directory ./ for MineSight files.
    basefolder = fld.SAMEFOLDER + fld.FRONTSLASH.join(foldernames[idx + 1:])
    return basedir, basefolder

def dealwithtoplevel(firstdir):
    """
    Compress top level files in the
    MineSight project directory.

    firstdir is the three tuple returned
    from the os.walk() generator function.
    Returns two tuple of integer smallfile
    multifilecounter used for naming
    multiple file archives and splitfilestrackerfile,
    an open file object for tracking split
    files for later reconstruction.
    """
    # Top level files.
    dirx = firstdir[0] + fld.FRONTSLASH
    basedir, basefolder = getbasedir(dirx)
    # File to track split files for later glueing back together.
    splitfilestrackerfile = open(fld.SAMEFOLDER + basedir + fld.FRONTSLASH +
                                 fld.SPLITFILETRACKER, fld.W)
    firstdirfiles = segregatefiles(firstdir[0], firstdir[2])
    compresslargefiles(firstdirfiles[fld.LARGE], dirx, fld.EMPTY, basedir,
                       splitfilestrackerfile)
    compressmediumfiles(firstdirfiles[fld.MEDIUM], dirx, fld.EMPTY, basedir)
    # This is for keeping track of
    # archives with more than one file.
    multifilecounter = 1
    mulitfilecounter = compresssmallfiles(firstdirfiles[fld.SMALL], dirx,
                                          fld.EMPTY, multifilecounter, basedir)
    return multifilecounter, splitfilestrackerfile

def dealwithlowerleveldirectories(dirs, multifilecounter, splitfilestrackerfile):
    """
    Finishes out compression of lower level
    folders under top level MineSight project
    directory.
    dirs is a partially exhausted (one iteration)
    os.walk() generator.
    multifilecounter is an integer used for
    naming multiple file archives.
    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.
    Returns orphanedfolders, a list of lower level
    folders to be deleted at the end of the program
    run.
    """
    orphanedfolders = []
    for dirx in dirs:
        # XXX - hack - I hate dealing with Windows paths.
        dirn = dirx[0].replace(fld.BACKSLASH, fld.FRONTSLASH)
        diry = dirn + fld.FRONTSLASH
        basedir, basefolder = getbasedir(diry)
        # Create directory in program path.
        fauxdir = fld.PROGFOLDER[:-1] + basefolder[1:-1]
        os.mkdir(fauxdir)
        orphanedfolders.append(fauxdir)
        # Skip anything that doesn't have files.
        if not dirx[2]:
            continue
        # Easiest way to do this might be
        # to track directories and sort
        # files according to size, then
        # filter them accordingly.
        dirfiles = segregatefiles(dirx[0], dirx[2])
        compresslargefiles(dirfiles[fld.LARGE], diry, basefolder,
                           basedir, splitfilestrackerfile)
        compressmediumfiles(dirfiles[fld.MEDIUM], diry, basefolder, basedir)
        multifilecounter = compresssmallfiles(dirfiles[fld.SMALL], diry, basefolder,
                                              multifilecounter, basedir)
    splitfilestrackerfile.close()
    return orphanedfolders

def walkdir(dirx):
    """
    Traverse MineSight project directory,
    7zipping everything along the way.
    dirx is a string for the directory
    to traverse.
    Side effect function.
    """
    dirs = os.walk(dirx)
    # OK - os.walk returns generator that
    #      yields a tuple in the format
    #          (str path,
    #           [list of paths for directories under path],
    #           [list of filenames under path])
    # Top level (Fwaulu, for instance).
    # These files will not have a path
    # prefix of any sort in their respective
    # archives.
    firstdir = dirs.next()
    multifilecounter, splitfilestrackerfile = dealwithtoplevel(firstdir)
    # All other files and folders.
    orphanedfolders = dealwithlowerleveldirectories(dirs, multifilecounter,
                                                    splitfilestrackerfile)
    # Delete lower level folders first - this is necessary.
    orphanedfolders.reverse()
    for orphanx in orphanedfolders:
        print(fld.DELETINGDIR.format(orphanx))
        os.rmdir(orphanx)

def cyclefolders(folderx):
    """
    Wrapper function for compression
    of folder folderx (string).
    Side effect function.
    """
    # 1) Set up empty project directory (ex: Fwaulu)
    #    in program directory.
    # 2) For first set of files, use no prefix for
    #    7zip archive storage (filename only).
    # 3) Check for size of file.
    # 4) If file is bigger than fld.CHUNK, split.
    # 5) If file is smaller than fld.CHUNK, but bigger than
    #    MULTFILELIMIT, compress to one archive.
    # 6) If file is smaller than fld.CHUNK, and smaller than
    #    MULTFILELIMIT, check subsequent files to determine
    #    files to include in archive. Keep track of file
    #    index that puts number of bytes over limit.
    # 7) Compress multiple files to one archive - index
    #    archive to ensure unique name.
    # 8) For all following sets of files, same process,
    #    but must prefix paths with SAMEFOLDER and any
    #    additional folder names.
    foldertracker = []
    # Make directory folder in program directory
    # to hold 7zip files.
    zipfolder = getbasedir(folderx)[0]
    os.mkdir(zipfolder)
    foldertracker.append(zipfolder)
    walkdir(folderx)
    print('\nDone')

cyclefolders is the overarching wrapper function for the module (compression operation).

#!java -jar C:\jython2.7.0\jython.jar

# unsevenzipper.py

"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to un-7zip archives.
"""

# Need to adjust path to get necessary jar imports.
# XXX - it might be cleaner to chain imports by using
#       the sevenzipper (s7 alias) below to reference
#       double imported modules. For development and
#       convenience I reimported everything as though
#       sevenzipper.py and unsevenzipper.py were separate
#       operations.
import sys
import folderstozip as fld
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)
import os
import sevenzipper as s7
import SevenZipThingExtract

def subdirectoryornot(pathx):
    """
    Boolean function that returns
    True if string pathx is a
    subdirectory of the MineSight
    project folder and False if
    the files belong directly to
    the MineSight project folder.
    """
    pathx = pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH)
    pathlist = pathx.split(fld.BACKSLASH)
    if len(pathlist) > 1:
        return True
    return False

def getdirectories(dirx):
    """
    Get list of lists of directories
    in path under project folder
    from 7zip archives in project
    folder for archives.
    Returns two tuple of list and
    dictionary indicating which
    7z files are same directory
    archives and which are archived
    subdirectory files.
    dirx is a string for the file
    path of the directory to
    be walked (./Fwaulu for example).
    """
    dirs = os.walk(dirx)
    # One level, no subfolders.
    files = dirs.next()[2]
    # Get directories first.
    rawpaths = []
    subdirornot = {}
    for filex in files:
        # Skip uncompressed split file tracker.
        if filex == fld.SPLITFILETRACKER:
            continue
        # I don't know if it's a subdirectory or not, so I'll go with False.
        s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex, dirx, False)
        folders = list(s7tx.getfolders())
        rawpaths.extend(folders)
        # All the paths in folders have the same prefix -
        # just do one.
        subdirornot[filex] = subdirectoryornot(folders[0])
    # Get just directories
    justdirectories = [pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH).split(fld.BACKSLASH)[1:-1]
                       for pathx in rawpaths if pathx.split(fld.BACKSLASH)[1:-1]]
    justdirectories = set([tuple(x) for x in justdirectories])
    justdirectories = list(justdirectories)
    justdirectories.sort()
    return justdirectories, subdirornot

def makedirectories(dirn):
    """
    Create directory paths within archive
    project folder to accept uncompressed
    files.
    Returns subdirornot dictionary.
    dirn is a string for the file
    path of the directory to
    be walked (./Fwaulu for example).
    """
    justdirectories, subdirornot = getdirectories(dirn)
    maxdepth = max(len(dirx) for dirx in justdirectories)
    for x in xrange(0, maxdepth):
        justdirectoriesii = set([tuple(dirx[0:x + 1]) for dirx in justdirectories
                                 if len(dirx) >= x + 1])
        for diry in justdirectoriesii:
            dirw = dirn + fld.FRONTSLASH + fld.FRONTSLASH.join(diry)
            os.mkdir(dirw)
    return subdirornot
def extractfiles(dirx):
    """
    Extract files from 7z files
    in project archive folder.
    Side effect function.
    dirx is a string for the file
    path of the directory to
    be walked.
    """
    subdirornot = makedirectories(dirx)
    dirs = os.walk(dirx)
    # One level, no subfolders.
    files = dirs.next()[2]
    for filex in files:
        # Skip uncompressed split file tracker.
        if filex == fld.SPLITFILETRACKER:
            continue
        s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex,
                                    dirx, subdirornot[filex])
        s7tx.extractfiles()

def gluetogethersplitfiles(dirx):
    """
    Make split up files whole.
    Side effect function.
    dirx is the folder in which the split
    files reside.
    """
    # Glue together big files.
    # Do this in a very controlling,
    # structured way:
    # 1) Read the split file tracker csv file.
    # 2) Determine the number and names and paths
    #    of files to be reconstructed and the
    #    number of parts in each.
    # 3) Check that everything is there for
    #    each file to be reconstructed.
    # 4) Get the new relative path.
    # 5) Glue back together programmatically.
    splitfiles = []
    # fld.SPLITFILETRACKER is structured as original path
    # of file split, number of file split.
    with open(fld.SAMEFOLDERWIN + dirx +
              fld.FRONTSLASH + fld.SPLITFILETRACKER, 'r') as f:
        for linex in f:
            strippedline = [x.strip() for x in linex.split(fld.UCOMMA)]
            splitfiles.append(tuple(strippedline))
    orignames = [x[0] for x in splitfiles]
    splitoriginals = set(orignames)
    # Make dictionary that is easy to cycle through.
    filesx = {}
    for orig in splitoriginals:
        basedir, basefolder = s7.getbasedir(orig)
        filesx[orig] = {}
        filesx[orig][fld.BASEPATH] = fld.SAMEFOLDER + basedir + basefolder[1:]
        filesx[orig][fld.FILES] = (fld.SPLITFILE.format(filesx[orig][fld.BASEPATH], filex[1])
                for filex in splitfiles if filex[0] == orig)
    for orig in filesx:
        with open(filesx[orig][fld.BASEPATH], fld.WB) as mainfile:
            for filex in filesx[orig][fld.FILES]:
                with open(filex, fld.RB) as splitfile:
                    mainfile.write(splitfile.read())

def restore(dirx):
    """
    Restores MineSight project directory
    inside program path.
    dirx is a string for the directory
    to be restored (./Fwaulu, for example).
    Side effect function.
    """
    extractfiles(dirx)
    gluetogethersplitfiles(dirx)
    print('Done')

restore is the main function for the module (uncompression).

Notes:

1) I don't have admin rights at work and did not have javac (the compiler for java) available. You can download an SDK or SRE java package from Oracle that has it. Without admin rights, you can't install it normally. Still you can use it. My compilation went something like this:

<path to downloaded JDK>/bin/javac -cp <path to downloaded 7-ZipJBinding>/lib/* <myclassname>.java

2) I've left all the split up files and 7z archives in the folder where I decompress my files and recombine the split files. This takes up a lot of space depending on what you're working with. If space is at a premium, you probably want to write jython code to move or delete the archives after uncompressing them.

3) The most time consuming part of runtime is the compression, uncompression, and splitting and recombining of split files. Porting some of this to java (instead of jython) might speed things up. I code faster and generally better in jython. Also, my objective was control, not speed. YMMV (your mileage may vary) with this approach. There are far better general purpose ones.

Thanks for stopping by.

pyright