Lately I had another Excel-VBA-Python one off hack project. Once again there was the dilemma of not being able to use MSSQL's bcp because my query string was too long. sqlcmd can run a query from a big SQL file, but, to the best of my knowledge, it does not do csv dumps.
This is a hack. I would normally go to hell for it, but I've done so many other bad hacks I'd have to declare bankruptcy on my programming soul and start over. Onward.
mssql query file:
<SQL code>
< . . . variable declarations, temp table declarations, etc. . . . >
DECLARE @COMMA CHAR(1) = ',';
DECLARE @LOSSLESS INT = 3;
DECLARE @DOUBLEQUOTE CHAR(1) = CHAR(34);
-- Concatenate strings.
-- Need quoted strings for stockpiles with spaces.
SELECT @DOUBLEQUOTE + StockpileShortName +
@DOUBLEQUOTE + @COMMA +
@DOUBLEQUOTE + StockpileID +
@DOUBLEQUOTE + @COMMA +
@DOUBLEQUOTE + StkLoc +
@DOUBLEQUOTE + @COMMA +
-- Go for full float precision.
CONVERT(VARCHAR(35),
tonnes,
@LOSSLESS) + @COMMA +
CONVERT(VARCHAR(35),
grade01,
@LOSSLESS) + @COMMA +
CONVERT(VARCHAR(35),
grade02,
@LOSSLESS) + @COMMA +
CONVERT(VARCHAR(35),
grade03,
@LOSSLESS) + @COMMA +
CONVERT(VARCHAR(35),
grade04,
@LOSSLESS) + @COMMA +
CONVERT(VARCHAR(35),
grade05,
@LOSSLESS) + @COMMA +
CONVERT(VARCHAR(35),
grade06,
@LOSSLESS)
FROM ##inputresultspvctrachte
< . . . ORDER BY clause . . .>
<End SQL code>
It's pretty obvious what I'm doing (and I'd be shocked if I'm the first to do it): list all my fields on one line separated by commas that are part of the result record.
A couple notes:
1) all my string identifiers are in double quotes; all my float values are in unquoted text - this will help simplify the Python csv module code below.
2) the @LOSSLESS "constant" - Microsoft's SQL documentation doesn't list an enumeration for this per se. It's just a straight up whole number 3. I'm a bit obsessive about constants - wrap that baby in a variable declaration! Lossless double precision means, if I recall correctly, SQL Server will give you seventeen digits of precision. This works for what I'm doing (mining stockpile management).
The (rough) mssql command to run the query from a DOS prompt:
sqlcmd -S MYSERVERNAME -U MYUSERNAME -P MYPASSWORD -I myqueryfile.sql -o theoutputfile.csv -b
The -b switch provides a Windows error code. It's a crude check for whether the query parsed OK and ran, but it's better than nothing.
The output looks something like this (sorry about the small font):
<. . . sqlcmd messages . . .>
"KEY003","hakunamatadacopper","good",28776.5,X.XXXXX,X.XXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY005","tembomalachite","not as good",25855.9,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY006","simbacobalt","not as good",156767,X.XXXXXX,X.XXXXXXX,X.XXXXXX,X.XXXXXXX,XX.XXXX,X.XXXXXX
"KEY010","jambocobalt","good",488977,X.XXXXX,X.XXXXXX,X.XXXX,X.XXXXXX,XXX.XXX,X.XXXXX
"KEY015","cucoagogo","good",39576.7,X.XXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XX.XXXX,X.XXXXX
"KEY016","greenrock","good",160,X.XXX,X.XXX,X.XXX,X.XXX,XXX.XX,X.XX
"KEY033","pinkrock","not as good",81504.3,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXX
"KEY006","funkyleach","not as good",55866.1,X.XXXXXX,X.XXXXXX,X.XXXXXX,X.XXXXXX,XXX.XXX,X.XXXXXX
"KEY010","metalhome","good",30301.1,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XX,X.XXXXX
"KEY015","boulderpile","good",2878.25,X.XX,X.XX,X.XXX,X.XXX,XX.XXX,X.XXX
"KEY033","berm","not as good",5309.97,X.XXXXX,X.XXXXXX,X.XXXXX,X.XXXXXX,XXX.XXX,X.XXXXX
(11 rows affected)
I've given my stockpiles funny names and X'ed out the numeric grades to sanitize this, but you get the general idea.
Now, finally to some Python code. I'll get the lines of the file (faux csv) I want and parse them with the csv module reader object. The whole deal is kind of verbose (I have a collections.namedtuple object that takes each "column" as an attribute). I'm only going to show the part that segregates the lines I want and reads them with the csv reader. The wpx module has all of my constants and static data definition in it. Some of the whitespace issues I still need to work out. For now I brute force stripped off leading and trailing whitespace from values.
def parsesqlcmdoutput():
"""
Parse output from sqlcmd.
Returns list of
collections.namedtuple
objects.
"""
lines = []
with open(wpx.OUTPUTFILE +
wpx.CSVEXT, 'r') as f:
# Get relevant lines.
# Rip whitespace off end - excessive.
# XXX - string overloading - hack.
lines = [linex.strip() for
linex in f if
linex[0:wpx.STKFLAG[0]] ==
wpx.STKFLAG[1]]
rdr = csv.reader(lines, quoting =
QUOTENONN)
records = []
for r in rdr:
# Get rid of whitespace padding
# around string values.
for x in xrange(wpx.IHSTRIDX):
r[x] = r[x].strip()
records.append(wpx.INPUTRECORD(*r))
return records
That csv.QUOTENONN (quote non-numeric) is handy. As per the Python doc, anything that isn't quoted is taken as a float. As long as my data are clean, I should be good there and it strips out some cruft code-wise.
The list comprehension is an iterable object the same way a file is, so the csv module's reader works fine on it.
That's about it (minus a lot of background code - if you need that, let me know and I'll put it in the comments).
Thanks for stopping by.
Thursday, August 11, 2016
Sunday, July 10, 2016
Using Generators and Coroutines to Merge Tabular Data (Drill Holes)
I have some mining drill hole data that I need to merge into an old vendor FORTRAN input format. Basically I do a series of SQL pulls from the drillhole database to csv files, then merge the data. My methodology has been a bit brute force in matching the separate parts of the drill hole data (lists, opening and closing of files to find matching holes, etc.). My thought was that I could do this more elegantly and efficiently by iterating through the files with generators.
The ability of generators to communicate with each other via the send() method intrigued me. I had always been a bit shy about using this language feature. My csv problem gave me a justification for checking it out.
The reference I used was Dr. Dave Beazley's 2009 Pycon Tutorial. He does a nice job of explaining things as well as dispatching good advice. (I disobeyed the good advice in the interest of shoehorning coroutines into my solution; I'll cover this below.) Beazley defines a coroutine in the sense of generators and the "yield" keyword as generators where "yield" is used more generally. That is the context I'm using the word "coroutine" in this post.
Given my problem of a one (drill hole start survey) to many (drill hole interval values) relationship, I attempted a very simple (perhaps oversimplified) toy program demo of what I wanted to do with real data:
def coroutinex(subgenerator):
"""
Generator function that consumes
a key value sent from a higher
level generator. This generator
yields two tuples of the form
(<boolean>, data). The boolean
value indicates whether the key
matches the data.
Returns a generator.
"""
while True:
# One entry point for send()/reset.
keyx = yield
subdatatop = next(subgenerator)
if subdatatop[0] == keyx:
yield (True, subdatatop)
for subdataloop in subgenerator:
if subdataloop[0] == keyx:
yield (True, subdataloop)
else:
yield (False, subdataloop)
break
def toplevelgen(topleveliter, coroutinex):
"""
Top level generator function.
subgenerator is a generator
that this generator sends
a key value to. The
subgenerator yields a two
tuple that communicates if
the key matches or not.
Returns a generator.
"""
# Get sub generator/coroutine initialized.
coroutinex.send(None)
# Variable for dealing with return
# from sub-generator/coroutine.
subvalue = False
for keyx in topleveliter:
yield keyx
if subvalue:
yield subvalue
subvalue = coroutinex.send(keyx)
# Get sub generator/coroutine re-initialized
# after send() reset.
if subvalue is None:
# XXX - hack
subvalue = coroutinex.send(keyx)
yield subvalue
for submessage in coroutinex:
# XXX - another hack to deal with yield of None.
if not submessage:
continue
subvalue = submessage
# if submessage[0] is True, kick it out.
if submessage[0]:
yield submessage
else:
# Keep subvalue for after keyvalue
# yield at top.
break
topleveliter = range(44, 55)
keysx = [44, 44, 44, 45, 45, 45, 45, 45,
46, 46, 46, 46, 46, 46, 46, 46,
47, 47, 47, 48, 48, 48, 48, 48,
49, 49, 49, 49, 49, 49, 50, 50,
51, 51, 51, 51, 51, 51, 51, 51,
52, 52, 52, 52, 52, 52, 52, 52,
53, 53, 53, 53, 53, 53, 53, 53,
54, 54, 54, 54, 54, 54, 54, 54]
sequencex = range(1, len(keysx) + 1)
subgenerator = zip(keysx, sequencex)
gensub = coroutinex(subgenerator)
genmain = toplevelgen(topleveliter, gensub)
for x in genmain:
print(x)
Output:
44
(True, (44, 1))
(True, (44, 2))
(True, (44, 3))
45
(False, (45, 4))
(True, (45, 5))
(True, (45, 6))
(True, (45, 7))
(True, (45, 8))
46
(False, (46, 9))
(True, (46, 10))
(True, (46, 11))
(True, (46, 12))
(True, (46, 13))
(True, (46, 14))
(True, (46, 15))
(True, (46, 16))
47
(False, (47, 17))
(True, (47, 18))
(True, (47, 19))
48
(False, (48, 20))
(True, (48, 21))
(True, (48, 22))
(True, (48, 23))
(True, (48, 24))
49
(False, (49, 25))
(True, (49, 26))
(True, (49, 27))
(True, (49, 28))
(True, (49, 29))
(True, (49, 30))
50
(False, (50, 31))
(True, (50, 32))
51
(False, (51, 33))
(True, (51, 34))
(True, (51, 35))
(True, (51, 36))
(True, (51, 37))
(True, (51, 38))
(True, (51, 39))
(True, (51, 40))
52
(False, (52, 41))
(True, (52, 42))
(True, (52, 43))
(True, (52, 44))
(True, (52, 45))
(True, (52, 46))
(True, (52, 47))
(True, (52, 48))
53
(False, (53, 49))
(True, (53, 50))
(True, (53, 51))
(True, (53, 52))
(True, (53, 53))
(True, (53, 54))
(True, (53, 55))
(True, (53, 56))
54
(False, (54, 57))
(True, (54, 58))
(True, (54, 59))
(True, (54, 60))
(True, (54, 61))
(True, (54, 62))
(True, (54, 63))
(True, (54, 64))
Back to Dr. Beazley's advice - he doesn't recommend this - even though "yield" is the keyword, it means two different things in two different contexts. Do not mix generator and coroutine functionality. I'm going ahead in this post and doing it anyway. I don't have an excuse. It does remind me of some old Bob Dylan lyrics:
The ability of generators to communicate with each other via the send() method intrigued me. I had always been a bit shy about using this language feature. My csv problem gave me a justification for checking it out.
The reference I used was Dr. Dave Beazley's 2009 Pycon Tutorial. He does a nice job of explaining things as well as dispatching good advice. (I disobeyed the good advice in the interest of shoehorning coroutines into my solution; I'll cover this below.) Beazley defines a coroutine in the sense of generators and the "yield" keyword as generators where "yield" is used more generally. That is the context I'm using the word "coroutine" in this post.
Given my problem of a one (drill hole start survey) to many (drill hole interval values) relationship, I attempted a very simple (perhaps oversimplified) toy program demo of what I wanted to do with real data:
def coroutinex(subgenerator):
"""
Generator function that consumes
a key value sent from a higher
level generator. This generator
yields two tuples of the form
(<boolean>, data). The boolean
value indicates whether the key
matches the data.
Returns a generator.
"""
while True:
# One entry point for send()/reset.
keyx = yield
subdatatop = next(subgenerator)
if subdatatop[0] == keyx:
yield (True, subdatatop)
for subdataloop in subgenerator:
if subdataloop[0] == keyx:
yield (True, subdataloop)
else:
yield (False, subdataloop)
break
def toplevelgen(topleveliter, coroutinex):
"""
Top level generator function.
subgenerator is a generator
that this generator sends
a key value to. The
subgenerator yields a two
tuple that communicates if
the key matches or not.
Returns a generator.
"""
# Get sub generator/coroutine initialized.
coroutinex.send(None)
# Variable for dealing with return
# from sub-generator/coroutine.
subvalue = False
for keyx in topleveliter:
yield keyx
if subvalue:
yield subvalue
subvalue = coroutinex.send(keyx)
# Get sub generator/coroutine re-initialized
# after send() reset.
if subvalue is None:
# XXX - hack
subvalue = coroutinex.send(keyx)
yield subvalue
for submessage in coroutinex:
# XXX - another hack to deal with yield of None.
if not submessage:
continue
subvalue = submessage
# if submessage[0] is True, kick it out.
if submessage[0]:
yield submessage
else:
# Keep subvalue for after keyvalue
# yield at top.
break
topleveliter = range(44, 55)
keysx = [44, 44, 44, 45, 45, 45, 45, 45,
46, 46, 46, 46, 46, 46, 46, 46,
47, 47, 47, 48, 48, 48, 48, 48,
49, 49, 49, 49, 49, 49, 50, 50,
51, 51, 51, 51, 51, 51, 51, 51,
52, 52, 52, 52, 52, 52, 52, 52,
53, 53, 53, 53, 53, 53, 53, 53,
54, 54, 54, 54, 54, 54, 54, 54]
sequencex = range(1, len(keysx) + 1)
subgenerator = zip(keysx, sequencex)
gensub = coroutinex(subgenerator)
genmain = toplevelgen(topleveliter, gensub)
for x in genmain:
print(x)
Output:
44
(True, (44, 1))
(True, (44, 2))
(True, (44, 3))
45
(False, (45, 4))
(True, (45, 5))
(True, (45, 6))
(True, (45, 7))
(True, (45, 8))
46
(False, (46, 9))
(True, (46, 10))
(True, (46, 11))
(True, (46, 12))
(True, (46, 13))
(True, (46, 14))
(True, (46, 15))
(True, (46, 16))
47
(False, (47, 17))
(True, (47, 18))
(True, (47, 19))
48
(False, (48, 20))
(True, (48, 21))
(True, (48, 22))
(True, (48, 23))
(True, (48, 24))
49
(False, (49, 25))
(True, (49, 26))
(True, (49, 27))
(True, (49, 28))
(True, (49, 29))
(True, (49, 30))
50
(False, (50, 31))
(True, (50, 32))
51
(False, (51, 33))
(True, (51, 34))
(True, (51, 35))
(True, (51, 36))
(True, (51, 37))
(True, (51, 38))
(True, (51, 39))
(True, (51, 40))
52
(False, (52, 41))
(True, (52, 42))
(True, (52, 43))
(True, (52, 44))
(True, (52, 45))
(True, (52, 46))
(True, (52, 47))
(True, (52, 48))
53
(False, (53, 49))
(True, (53, 50))
(True, (53, 51))
(True, (53, 52))
(True, (53, 53))
(True, (53, 54))
(True, (53, 55))
(True, (53, 56))
54
(False, (54, 57))
(True, (54, 58))
(True, (54, 59))
(True, (54, 60))
(True, (54, 61))
(True, (54, 62))
(True, (54, 63))
(True, (54, 64))
Back to Dr. Beazley's advice - he doesn't recommend this - even though "yield" is the keyword, it means two different things in two different contexts. Do not mix generator and coroutine functionality. I'm going ahead in this post and doing it anyway. I don't have an excuse. It does remind me of some old Bob Dylan lyrics:
Now the rainman gave me two cures
Then he said, "Jump right in"
The one was Texas medicine
The other was just railroad gin
An' like a fool I mixed them
An' it strangled up my mind
Then he said, "Jump right in"
The one was Texas medicine
The other was just railroad gin
An' like a fool I mixed them
An' it strangled up my mind
It's OK, Bob, some of us just need to learn things the hard way.
Onward.
A brief diversion on drill holes - the data for a small scale (about 2,000 feet or less) geotechnical or gelogic drill hole come back in three parts:
1) collar - where the hole starts in space (coordinates).
1) collar - where the hole starts in space (coordinates).
2) surveys - where the hole ends up going in space relative to the collar (drill pipe has proven to be amazingly flexible when passing through rock).
3) assays - usually the hole is sampled along intervals and chemically or physically analyzed. The assay intervals may or may not coincide with survey intervals.
Clear as (drilling) mud? Great - back to Python.
The problem:
Three tabular csv dumps from SQL - a collar file, a survey file, and an assay file. Each has a unique key in the first column that matches across files (the drill hole key). On the SQL side I have ensured that there are no orphan key rows in any of the three files and that all three are sorted on the key.
I present the sanitized ouput here first - it will give some context to the domain specific parts of the code:
XXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,26.4529
XXXXX,0.0000,1.1925,1.1925,283.5688,-13.5310
XXXXX,1.1925,4.2760,3.0836,284.6224,1.9328 SURVEYS
XXXXX,4.2760,6.3799,2.1039,280.2829,-3.1334 GO
XXXXX,6.3799,9.7024,3.3225,282.5794,2.3632 HERE
XXXXX,9.7024,11.8701,2.1677,285.4406,-1.1631 AFTER
XXXXX,11.8701,13.6920,1.8219,275.9462,-5.0698 COLLAR
XXXXX,13.6920,17.1199,3.4279,285.4561,1.9560 LOCATION
XXXXX,17.1199,19.6944,2.5746,279.2318,-0.7344
XXXXX,19.6944,22.5857,2.8913,282.1947,4.3241
XXXXX,22.5857,24.1879,1.6022,283.8367,-1.7525
XXXXX,24.1879,26.4529,2.2650,287.3820,13.4805
XXXXX <----- LEGACY DRILLHOLE NUMBER
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc. ASSAYS
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc. GO
XXXXX,X.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc. HERE
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
<----- BLANK LINE
XXXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,23.5411
XXXXXX,0.0000,2.5781,2.5781,135.0157,2.3341
XXXXXX,2.5781,5.0351,2.4570,137.1873,5.5353
XXXXXX,5.0351,7.3706,2.3354,135.2276,7.7020
XXXXXX,7.3706,9.9168,2.5462,136.4253,6.4493
.
.
.
.
.
.
.
etc.
XXXXX,0.0000,1.1925,1.1925,283.5688,-13.5310
XXXXX,1.1925,4.2760,3.0836,284.6224,1.9328 SURVEYS
XXXXX,4.2760,6.3799,2.1039,280.2829,-3.1334 GO
XXXXX,6.3799,9.7024,3.3225,282.5794,2.3632 HERE
XXXXX,9.7024,11.8701,2.1677,285.4406,-1.1631 AFTER
XXXXX,11.8701,13.6920,1.8219,275.9462,-5.0698 COLLAR
XXXXX,13.6920,17.1199,3.4279,285.4561,1.9560 LOCATION
XXXXX,17.1199,19.6944,2.5746,279.2318,-0.7344
XXXXX,19.6944,22.5857,2.8913,282.1947,4.3241
XXXXX,22.5857,24.1879,1.6022,283.8367,-1.7525
XXXXX,24.1879,26.4529,2.2650,287.3820,13.4805
XXXXX <----- LEGACY DRILLHOLE NUMBER
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc.
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc. ASSAYS
XXXXX,X.XXXX,X.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc. GO
XXXXX,X.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX, etc. HERE
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
XXXXX,XX.XXXX,XX.XXXX,X.XXXX,X.XX,X.XX,X.XX,XX.XX, etc.
<----- BLANK LINE
XXXXXX,XXXXXX.XXXX,XXXXXXX.XXXX,XXXX.XXXX,0.0000,0.0000,23.5411
XXXXXX,0.0000,2.5781,2.5781,135.0157,2.3341
XXXXXX,2.5781,5.0351,2.4570,137.1873,5.5353
XXXXXX,5.0351,7.3706,2.3354,135.2276,7.7020
XXXXXX,7.3706,9.9168,2.5462,136.4253,6.4493
.
.
.
.
.
.
.
etc.
And the code (sorry about the size - it got messier than I would have hoped):
#!C:\Python35\python
"""
Parse collar, survey, and assay dumps for
trenches from vendor drill hole RDBMS.
Write specially formatted data file for
consumption by old vendor FORTRAN
routine 201.
"""
import csv
from collections import namedtuple
from collections import OrderedDict
COLLAR = './data/collar.csv'
SURVEY = './data/survey.csv'
ASSAYS = './data/assays.csv'
DAT201 = './data/TR.dat'
# collar (ssit) fields
ID = 'drillholeid'
NAME = 'drillholename'
DATE = 'drillholedate'
LEGACY = 'drillholehistoricname'
X = 'collarx'
Y = 'collary'
Z = 'collarz'
AZ = 'azimuth'
DIP = 'dip'
LEN = 'drillholelength'
COLLARFIELDS = [ID, NAME, DATE, LEGACY, X, Y, Z,
AZ, DIP, LEN]
# survey fields
FROM = 'fromx'
TO = 'depthto'
SAMPLEN = 'surveylength'
AZ = 'azimuth'
DIP = 'dip'
SURVEYFIELDS = [ID, NAME, DATE, LEGACY, FROM, TO,
SAMPLEN, AZ, DIP]
# assay fields
AFROM = 'assayfrom'
ATO = 'assayto'
AI = 'assayinterval'
ASSAY1 = 'assay1'
ASSAY2 = 'assay2'
ASSAY3 = 'assay3'
ASSAY4 = 'assay4'
ASSAY5 = 'assay5'
ASSAY6 = 'assay6'
ASSAY7 = 'assay7'
ASSAY8 = 'assay8'
ASSAYFIELDS = [ID, NAME, LEGACY, AFROM, ATO, AI, ASSAY1,
ASSAY2, ASSAY3, ASSAY4, ASSAY5, ASSAY6, ASSAY7, ASSAY8]
ASSAYFORMAT = '.2f'
SURVEYFORMAT = '.4f'
COMMA = ','
# Output for 201 file format.
# Collars.
COLOUTPUTCOLS = [X, Y, Z, AZ, DIP, LEN]
COLFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in COLOUTPUTCOLS]
# Surveys.
SURVOUTPUTCOLS = [FROM, TO, SAMPLEN, AZ, DIP]
SURVFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in SURVOUTPUTCOLS]
# Assays.
ASSYOUTPUTCOLS = [AFROM, ATO, AI, ASSAY1, ASSAY2, ASSAY3, ASSAY4, ASSAY5,
ASSAY6, ASSAY7, ASSAY8]
ASSYOUTPUTFMTS = 3 * [SURVEYFORMAT] + 8 * [ASSAYFORMAT]
# Have to use this repeatedly - hence list.
ASSYFMTOUTPUT = list(zip(ASSYOUTPUTCOLS, ASSYOUTPUTFMTS))
RETCHAR = '\n'
# For tracking which dataset we're
# dealing with.
SURVEYSUBDATA = 'survey'
ASSAYSUBDATA = 'assay'
# For survey/assay dictionary.
COR = 'coroutine'
FMT = 'format'
LAST = 'lastvalue'
END = 'end'
INFOMESSAGE = 'Now doing hole number {0} . . .'
def makecsvdatagenerator(csvrdr, ntname, ntfields):
"""
Returns a generator that yields csv
row records as named tuple objects.
csvrdr is the csv.reader object.
ntname is the name given to the
collections.namedtuple object.
ntfields is the list of field names
for the collections.namedtuple object.
"""
namedtup = namedtuple(ntname, ntfields)
return (namedtup(*linex) for linex in csvrdr)
def formatassay(numstring, formatx):
"""
Returns a string representing a float
that typically is in 0.00 format, but
other float formats can be applied.
numstring is a string representing a float.
formatx is the desired format (Python 3 format string).
"""
return(format(float(numstring), formatx))
def getnumericstrings(record, formats):
"""
Returns list of strings.
record is a collections.namedtuple instance.
formats is a list of two-lists of namedtuple
attributes and numeric string formats to be
applied to each attribute's value.
"""
return [formatassay(record.__getattribute__(pairx[0]),
pairx[1])
for pairx in formats]
def coroutinex(subgenerator):
"""
Generator function.
Consumes key value and yields
two tuple of (<boolean>,
next(subgenerator)) in response.
boolean value indicates
whether key matches first
value of subgenerator namedtuple.
subgenerator is a generator of
namedtuples.
Returns a generator.
"""
while True:
keyx = yield
subdatatop = next(subgenerator)
if subdatatop.drillholeid == keyx:
yield (True, subdatatop)
for subdataloop in subgenerator:
if subdataloop.drillholeid == keyx:
yield (True, subdataloop)
else:
yield (False, subdataloop)
break
# Case where only one interval in
# drill hole.
else:
yield (False, subdatatop)
def formatdataline(record, formats):
"""
Prepare record as a line
of text for write to file.
record is a collections.namedtuple
object.
formats is a list of two tuples of
namedtuple attributes and numeric
string formats.
Returns string.
"""
recordline = [record.drillholehistoricname]
recordline.extend(getnumericstrings(record,
formats))
return COMMA.join(recordline) + RETCHAR
def dealwithsend(subgen, sendval):
"""
Helper function to clean up code.
Deals with initial receipt of
None value upon send() and
re-sends value.
Sends value sendval to
generator/coroutine subgen.
Returns two tuple of (<boolean>,
<collections.namedtuple>).
"""
retval = subgen.send(sendval)
if retval is None:
retval = subgen.send(sendval)
return retval
def dealwithyieldrecord(survassay, subdata):
"""
Helper function to clean up code.
Formats values for write to file.
survassay is a dictionary of values.
subdata is the dictionary key that
tells which data is being handled
(survey or assay).
"""
return formatdataline(survassay[subdata][LAST][1],
survassay[subdata][FMT])
def cyclecollars(collargen,
survassay):
"""
Generator function that yields
data (strings) for write to a
a specially formatted drill hole
file.
This is the top level generator
for working the merging of
drillhole data (collars, surveys,
assays).
survassay is a collections.OrderedDict
object that references the respective
survey and assay generators and holds
information for tracking which subset
of data (surveys or assays) are being
worked.
"""
for record in collargen:
keyx = record.drillholeid
label = record.drillholehistoricname
survassay[SURVEYSUBDATA][END] = label + RETCHAR
print(INFOMESSAGE.format(label))
yield formatdataline(record, COLFMTOUTPUT)
for subdata in survassay:
fmt = survassay[subdata][FMT]
if survassay[subdata][LAST]:
yield dealwithyieldrecord(survassay, subdata)
subvalue = dealwithsend(survassay[subdata][COR], keyx)
# Case where only one interval.
if not subvalue[0]:
survassay[subdata][LAST] = subvalue
yield survassay[subdata][END]
continue
yield formatdataline(subvalue[1], fmt)
for submessage in survassay[subdata][COR]:
# End of iteration.
if submessage is None:
yield survassay[subdata][END]
break
if submessage[0]:
yield formatdataline(submessage[1], fmt)
else:
survassay[subdata][LAST] = submessage
yield survassay[subdata][END]
break
def main():
"""
Parse csv dumps from SQL and write
drillhole data fields for import
to old vendor FORTRAN based binary
files.
Side effect function.
"""
with open(COLLAR, 'r') as colx:
colcsv = csv.reader(colx)
collargen = makecsvdatagenerator(colcsv,
'collars',
COLLARFIELDS)
with open(SURVEY, 'r') as svgx:
survcsv = csv.reader(svgx)
survgen = makecsvdatagenerator(survcsv,
'surveys',
SURVEYFIELDS)
surveycoroutinex = coroutinex(survgen)
with open(ASSAYS, 'r') as assx:
assycsv = csv.reader(assx)
assygen = makecsvdatagenerator(assycsv,
'assays',
ASSAYFIELDS)
assaycoroutinex = coroutinex(assygen)
with open(DAT201, 'w') as d201:
# Get sub generators/coroutines initialized.
surveycoroutinex.send(None)
assaycoroutinex.send(None)
surveyassay = OrderedDict()
surveyassay[SURVEYSUBDATA] = {COR:surveycoroutinex,
FMT:SURVFMTOUTPUT,
LAST:None,
END:None}
surveyassay[ASSAYSUBDATA] = {COR:assaycoroutinex,
FMT:ASSYFMTOUTPUT,
LAST:None,
END:RETCHAR}
colgenx = cyclecollars(collargen,
surveyassay)
for linex in colgenx:
d201.write(linex)
print('Done')
if __name__ == '__main__':
main()
"""
Parse collar, survey, and assay dumps for
trenches from vendor drill hole RDBMS.
Write specially formatted data file for
consumption by old vendor FORTRAN
routine 201.
"""
import csv
from collections import namedtuple
from collections import OrderedDict
COLLAR = './data/collar.csv'
SURVEY = './data/survey.csv'
ASSAYS = './data/assays.csv'
DAT201 = './data/TR.dat'
# collar (ssit) fields
ID = 'drillholeid'
NAME = 'drillholename'
DATE = 'drillholedate'
LEGACY = 'drillholehistoricname'
X = 'collarx'
Y = 'collary'
Z = 'collarz'
AZ = 'azimuth'
DIP = 'dip'
LEN = 'drillholelength'
COLLARFIELDS = [ID, NAME, DATE, LEGACY, X, Y, Z,
AZ, DIP, LEN]
# survey fields
FROM = 'fromx'
TO = 'depthto'
SAMPLEN = 'surveylength'
AZ = 'azimuth'
DIP = 'dip'
SURVEYFIELDS = [ID, NAME, DATE, LEGACY, FROM, TO,
SAMPLEN, AZ, DIP]
# assay fields
AFROM = 'assayfrom'
ATO = 'assayto'
AI = 'assayinterval'
ASSAY1 = 'assay1'
ASSAY2 = 'assay2'
ASSAY3 = 'assay3'
ASSAY4 = 'assay4'
ASSAY5 = 'assay5'
ASSAY6 = 'assay6'
ASSAY7 = 'assay7'
ASSAY8 = 'assay8'
ASSAYFIELDS = [ID, NAME, LEGACY, AFROM, ATO, AI, ASSAY1,
ASSAY2, ASSAY3, ASSAY4, ASSAY5, ASSAY6, ASSAY7, ASSAY8]
ASSAYFORMAT = '.2f'
SURVEYFORMAT = '.4f'
COMMA = ','
# Output for 201 file format.
# Collars.
COLOUTPUTCOLS = [X, Y, Z, AZ, DIP, LEN]
COLFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in COLOUTPUTCOLS]
# Surveys.
SURVOUTPUTCOLS = [FROM, TO, SAMPLEN, AZ, DIP]
SURVFMTOUTPUT = [(attribx, SURVEYFORMAT) for attribx in SURVOUTPUTCOLS]
# Assays.
ASSYOUTPUTCOLS = [AFROM, ATO, AI, ASSAY1, ASSAY2, ASSAY3, ASSAY4, ASSAY5,
ASSAY6, ASSAY7, ASSAY8]
ASSYOUTPUTFMTS = 3 * [SURVEYFORMAT] + 8 * [ASSAYFORMAT]
# Have to use this repeatedly - hence list.
ASSYFMTOUTPUT = list(zip(ASSYOUTPUTCOLS, ASSYOUTPUTFMTS))
RETCHAR = '\n'
# For tracking which dataset we're
# dealing with.
SURVEYSUBDATA = 'survey'
ASSAYSUBDATA = 'assay'
# For survey/assay dictionary.
COR = 'coroutine'
FMT = 'format'
LAST = 'lastvalue'
END = 'end'
INFOMESSAGE = 'Now doing hole number {0} . . .'
def makecsvdatagenerator(csvrdr, ntname, ntfields):
"""
Returns a generator that yields csv
row records as named tuple objects.
csvrdr is the csv.reader object.
ntname is the name given to the
collections.namedtuple object.
ntfields is the list of field names
for the collections.namedtuple object.
"""
namedtup = namedtuple(ntname, ntfields)
return (namedtup(*linex) for linex in csvrdr)
def formatassay(numstring, formatx):
"""
Returns a string representing a float
that typically is in 0.00 format, but
other float formats can be applied.
numstring is a string representing a float.
formatx is the desired format (Python 3 format string).
"""
return(format(float(numstring), formatx))
def getnumericstrings(record, formats):
"""
Returns list of strings.
record is a collections.namedtuple instance.
formats is a list of two-lists of namedtuple
attributes and numeric string formats to be
applied to each attribute's value.
"""
return [formatassay(record.__getattribute__(pairx[0]),
pairx[1])
for pairx in formats]
def coroutinex(subgenerator):
"""
Generator function.
Consumes key value and yields
two tuple of (<boolean>,
next(subgenerator)) in response.
boolean value indicates
whether key matches first
value of subgenerator namedtuple.
subgenerator is a generator of
namedtuples.
Returns a generator.
"""
while True:
keyx = yield
subdatatop = next(subgenerator)
if subdatatop.drillholeid == keyx:
yield (True, subdatatop)
for subdataloop in subgenerator:
if subdataloop.drillholeid == keyx:
yield (True, subdataloop)
else:
yield (False, subdataloop)
break
# Case where only one interval in
# drill hole.
else:
yield (False, subdatatop)
def formatdataline(record, formats):
"""
Prepare record as a line
of text for write to file.
record is a collections.namedtuple
object.
formats is a list of two tuples of
namedtuple attributes and numeric
string formats.
Returns string.
"""
recordline = [record.drillholehistoricname]
recordline.extend(getnumericstrings(record,
formats))
return COMMA.join(recordline) + RETCHAR
def dealwithsend(subgen, sendval):
"""
Helper function to clean up code.
Deals with initial receipt of
None value upon send() and
re-sends value.
Sends value sendval to
generator/coroutine subgen.
Returns two tuple of (<boolean>,
<collections.namedtuple>).
"""
retval = subgen.send(sendval)
if retval is None:
retval = subgen.send(sendval)
return retval
def dealwithyieldrecord(survassay, subdata):
"""
Helper function to clean up code.
Formats values for write to file.
survassay is a dictionary of values.
subdata is the dictionary key that
tells which data is being handled
(survey or assay).
"""
return formatdataline(survassay[subdata][LAST][1],
survassay[subdata][FMT])
def cyclecollars(collargen,
survassay):
"""
Generator function that yields
data (strings) for write to a
a specially formatted drill hole
file.
This is the top level generator
for working the merging of
drillhole data (collars, surveys,
assays).
survassay is a collections.OrderedDict
object that references the respective
survey and assay generators and holds
information for tracking which subset
of data (surveys or assays) are being
worked.
"""
for record in collargen:
keyx = record.drillholeid
label = record.drillholehistoricname
survassay[SURVEYSUBDATA][END] = label + RETCHAR
print(INFOMESSAGE.format(label))
yield formatdataline(record, COLFMTOUTPUT)
for subdata in survassay:
fmt = survassay[subdata][FMT]
if survassay[subdata][LAST]:
yield dealwithyieldrecord(survassay, subdata)
subvalue = dealwithsend(survassay[subdata][COR], keyx)
# Case where only one interval.
if not subvalue[0]:
survassay[subdata][LAST] = subvalue
yield survassay[subdata][END]
continue
yield formatdataline(subvalue[1], fmt)
for submessage in survassay[subdata][COR]:
# End of iteration.
if submessage is None:
yield survassay[subdata][END]
break
if submessage[0]:
yield formatdataline(submessage[1], fmt)
else:
survassay[subdata][LAST] = submessage
yield survassay[subdata][END]
break
def main():
"""
Parse csv dumps from SQL and write
drillhole data fields for import
to old vendor FORTRAN based binary
files.
Side effect function.
"""
with open(COLLAR, 'r') as colx:
colcsv = csv.reader(colx)
collargen = makecsvdatagenerator(colcsv,
'collars',
COLLARFIELDS)
with open(SURVEY, 'r') as svgx:
survcsv = csv.reader(svgx)
survgen = makecsvdatagenerator(survcsv,
'surveys',
SURVEYFIELDS)
surveycoroutinex = coroutinex(survgen)
with open(ASSAYS, 'r') as assx:
assycsv = csv.reader(assx)
assygen = makecsvdatagenerator(assycsv,
'assays',
ASSAYFIELDS)
assaycoroutinex = coroutinex(assygen)
with open(DAT201, 'w') as d201:
# Get sub generators/coroutines initialized.
surveycoroutinex.send(None)
assaycoroutinex.send(None)
surveyassay = OrderedDict()
surveyassay[SURVEYSUBDATA] = {COR:surveycoroutinex,
FMT:SURVFMTOUTPUT,
LAST:None,
END:None}
surveyassay[ASSAYSUBDATA] = {COR:assaycoroutinex,
FMT:ASSYFMTOUTPUT,
LAST:None,
END:RETCHAR}
colgenx = cyclecollars(collargen,
surveyassay)
for linex in colgenx:
d201.write(linex)
print('Done')
if __name__ == '__main__':
main()
The bad news: this was more difficult with a real world dataset than I anticipated. Beazley's admonition was an apt one.
The good news: it does perform better than my previous brute force implementations. From the standpoint of iterating through datasets and not wasting resources (even with the polling or interrupting or whatever facilitates the generator communication closer to the metal), this is a better implementation. Also, I learned a bit more about the "yield" keyword.
Thanks for stopping by.
Thanks for stopping by.
Labels:
coroutines,
drillholes,
generators,
python,
yield
Monday, April 18, 2016
7-Zip-JBinding API with jython on Windows
I have a set of multi-GB Windows folders that I need to archive in 7-zip format each month. I'd prefer not to use the mouse to compress the folders "manually." Also, I didn't want to use the command line with the subprocess module like I have with some other programs. Ideally, I wanted to control 7zip programmatically. The 7-Zip-JBinding libraries offered a means to do this from jython.
7-Zip-JBinding is written using java Interfaces that are structured pretty specifically. I did not venture too far away from the examples given in the 7-Zip-JBinding documentation. I smithed two modules for my own purposes, compressing and uncompressing, and present them (java code) below. The decompression one has a separate method for retrieving paths of the compressed files. This is not efficient, but for what I need to do, and for the limitations of the library and the approach, it works out for the best.
import java.io.IOException;
import java.io.RandomAccessFile;
import net.sf.sevenzipjbinding.IOutCreateArchive7z;
import net.sf.sevenzipjbinding.IOutCreateCallback;
import net.sf.sevenzipjbinding.IOutItem7z;
import net.sf.sevenzipjbinding.ISequentialInStream;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.OutItemFactory;
import net.sf.sevenzipjbinding.impl.RandomAccessFileOutStream;
import net.sf.sevenzipjbinding.util.ByteArrayStream;
/* Off StackOverflow - works for getting
* file content/bytes from path */
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;
public class SevenZipThing {
private static final String RETCHAR = "\n";
private static final String INTFMT = "%,d";
private static final String BYTESTOCOMPRESS = " bytes total to compress\n";
private static final String ERROCCURS = "Error occurs: ";
private static final String COMPRESSFILE = "\nCompressing file ";
private static final String RW = "rw";
private static final int LVL = 5;
private static final String SEVZERR = "7z-Error occurs:";
private static final String ERRCLOSING = "Error closing archive: ";
private static final String ERRCLOSINGFLE = "Error closing file: ";
private static final String SUCCESS = "\nCompression operation succeeded\n";
private String filename;
/* String[] array conversion from jython list
* implicit and poses no problems (JKD7) */
private String[] pathsx;
public SevenZipThing(String filename, String[] pathsx) {
this.filename = filename;
this.pathsx = pathsx;
}
/**
* The callback provides information about archive items.
*/
/**
* I copied this straight from the sevenZipJBinding's author's
* code - but I haven't put much in to deal with messaging
* or error handling
* */
private final class MyCreateCallback
implements IOutCreateCallback<IOutItem7z> {
public void setOperationResult(boolean operationResultOk)
throws SevenZipException {
// Track each operation result here
}
public void setTotal(long total) throws SevenZipException {
// Track operation progress here
System.out.print(RETCHAR + String.format(INTFMT, total) +
BYTESTOCOMPRESS);
}
public void setCompleted(long complete) throws SevenZipException {
// Track operation progress here
}
public IOutItem7z getItemInformation(int index,
OutItemFactory<IOutItem7z> outItemFactory) {
IOutItem7z item = outItemFactory.createOutItem();
Path path = Paths.get(pathsx[index]);
item.setPropertyPath(pathsx[index]);
try {
// Java arrays are limited to 2 ** 31 items - small.
byte[] data = Files.readAllBytes(path);
item.setDataSize((long) data.length);
return item;
// XXX - I could do a lot better than this (error handling).
} catch (Exception e) {
System.err.println(ERROCCURS + e);
}
return null;
}
public ISequentialInStream getStream(int i)
throws SevenZipException {
Path path = Paths.get(pathsx[i]);
try {
byte[] data = Files.readAllBytes(path);
System.out.println(COMPRESSFILE + path);
return new ByteArrayStream(data, true);
} catch (Exception e) {
System.err.println(ERROCCURS + e);
}
return null;
}
}
public void compress() {
/* Mostly copied from sevenZipJBinding's author's code -
* I made the compress method public to work from jython.
* Also, I deal with all of the file listing in jython
* and just pass a list to this class. */
boolean success = false;
RandomAccessFile raf = null;
IOutCreateArchive7z outArchive = null;
try {
raf = new RandomAccessFile(filename, RW);
// Open out-archive object
outArchive = SevenZip.openOutArchive7z();
// Configure archive
outArchive.setLevel(LVL);
outArchive.setSolid(true);
// All available processors.
outArchive.setThreadCount(0);
// Create archive
outArchive.createArchive(new RandomAccessFileOutStream(raf),
pathsx.length, new MyCreateCallback());
success = true;
} catch (SevenZipException e) {
System.err.println(SEVZERR);
// Get more information using extended method
e.printStackTraceExtended();
} catch (Exception e) {
System.err.println(ERROCCURS + e);
} finally {
if (outArchive != null) {
try {
outArchive.close();
} catch (IOException e) {
System.err.println(ERRCLOSING + e);
success = false;
}
}
if (raf != null) {
try {
raf.close();
} catch (IOException e) {
System.err.println(ERRCLOSINGFLE + e);
success = false;
}
}
}
if (success) {
System.out.println(SUCCESS);
}
}
}
import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.util.Arrays;
import java.util.ArrayList;
import net.sf.sevenzipjbinding.IInArchive;
import net.sf.sevenzipjbinding.PropID;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.RandomAccessFileInStream;
import net.sf.sevenzipjbinding.IArchiveExtractCallback;
import net.sf.sevenzipjbinding.ExtractOperationResult;
import net.sf.sevenzipjbinding.ExtractAskMode;
import net.sf.sevenzipjbinding.ISequentialOutStream;
/* 7z archive format */
/* SEVEN_ZIP is the one I want */
import net.sf.sevenzipjbinding.ArchiveFormat;
public class SevenZipThingExtract {
private String filename;
private String extractdirectory;
private ArrayList<String> foldersx = null;
private boolean subdirectory = false;
private static final String ERROPENINGFLE = "Error opening file: ";
private static final String ERRWRITINGFLE = "Error writing to file: ";
private static final String EXTERR = "Extraction error";
private static final String INFOFMT = "%9X | %10s | %s";
private static final String RETCHAR = "\n";
private static final String INTFMT = "%,d";
private static final String BYTESTOEXTRACT = " bytes total to extract\n";
private static final String RW = "rw";
private static final String BACKSLASH = "\\";
private static final String SEVZERR = "7z-Error occurs:";
private static final String ERROCCURS = "Error occurs: ";
private static final String ERRCLOSING = "Error closing archive: ";
private static final String ERRCLOSINGFLE = "Error closing file: ";
public SevenZipThingExtract(String filename, String extractdirectory,
boolean subdirectory) {
this.filename = filename;
foldersx = new ArrayList<String>();
this.foldersx = foldersx;
this.extractdirectory = extractdirectory;
this.subdirectory = subdirectory;
}
private final class MyExtractCallback
implements IArchiveExtractCallback {
// Copied mostly from example.
private int hash = 0;
private int size = 0;
private int index;
private boolean skipExtraction;
private IInArchive inArchive;
private OutputStream outputStream;
private File file;
public MyExtractCallback(IInArchive inArchive) {
this.inArchive = inArchive;
}
@Override
public ISequentialOutStream getStream(int index,
ExtractAskMode extractAskMode)
throws SevenZipException {
this.index = index;
// I'm not skipping anything.
skipExtraction = (Boolean) false;
String path = (String) inArchive.getProperty(index, PropID.PATH);
// Try preprending extractdirectory.
if (subdirectory) {
path = extractdirectory + BACKSLASH + path.substring(2);
} else {
path = extractdirectory + BACKSLASH + path;
}
file = new File(path);
try {
outputStream = new FileOutputStream(file);
} catch (FileNotFoundException e) {
throw new SevenZipException(ERROPENINGFLE
+ file.getAbsolutePath(), e);
}
return new ISequentialOutStream() {
public int write(byte[] data) throws SevenZipException {
try {
outputStream.write(data);
} catch (IOException e) {
throw new SevenZipException(ERRWRITINGFLE
+ file.getAbsolutePath());
}
return data.length; // Return amount of consumed data
}
};
}
public void prepareOperation(ExtractAskMode extractAskMode)
throws SevenZipException {
}
public void setOperationResult(ExtractOperationResult extractOperationResult)
throws SevenZipException {
// Track each operation result here
if (extractOperationResult != ExtractOperationResult.OK) {
System.err.println(EXTERR);
} else {
System.out.println(String.format(INFOFMT, hash, size,//
inArchive.getProperty(index, PropID.PATH)));
hash = 0;
size = 0;
}
}
public void setTotal(long total) throws SevenZipException {
System.out.print(RETCHAR + String.format(INTFMT, total) +
BYTESTOEXTRACT);
}
public void setCompleted(long complete) throws SevenZipException {
// Track operation progress here
}
}
private final class MyGetPathsCallback
implements IArchiveExtractCallback {
// Copied mostly from example.
private int hash = 0;
private int size = 0;
private int index;
private boolean skipExtraction;
private IInArchive inArchive;
public MyGetPathsCallback(IInArchive inArchive) {
this.inArchive = inArchive;
}
public ISequentialOutStream getStream(int index,
ExtractAskMode extractAskMode)
throws SevenZipException {
this.index = index;
// I'm not skipping anything.
skipExtraction = (Boolean) false;
String path = (String) inArchive.getProperty(index,
PropID.PATH);
foldersx.add(path);
return new ISequentialOutStream() {
public int write(byte[] data) throws SevenZipException {
hash ^= Arrays.hashCode(data);
size += data.length;
// Return amount of processed data
return data.length;
}
};
}
public void prepareOperation(ExtractAskMode extractAskMode)
throws SevenZipException {
}
public void setOperationResult(ExtractOperationResult extractOperationResult)
throws SevenZipException {
// Track each operation result here
if (extractOperationResult != ExtractOperationResult.OK) {
System.err.println(EXTERR);
} else {
System.out.println(String.format(INFOFMT, hash, size,
inArchive.getProperty(index, PropID.PATH)));
hash = 0;
size = 0;
}
}
public void setTotal(long total) throws SevenZipException {
System.out.print(RETCHAR + String.format(INTFMT, total) +
BYTESTOEXTRACT);
}
public void setCompleted(long complete) throws SevenZipException {
// Track operation progress here
}
}
public void extractfiles() {
boolean success = false;
RandomAccessFile raf = null;
IInArchive inArchive = null;
try {
raf = new RandomAccessFile(filename, RW);
inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
new RandomAccessFileInStream(raf));
int itemCount = inArchive.getNumberOfItems();
// From StackOverflow - could use IntStream,
// but that's Java 1.8 (using 1.7).
int[] fileindices = new int[itemCount];
for(int k = 0; k < fileindices.length; k++)
fileindices[k] = k;
inArchive.extract(fileindices, false,
new MyExtractCallback(inArchive));
} catch (SevenZipException e) {
System.err.println(SEVZERR);
// Get more information using extended method
e.printStackTraceExtended();
} catch (Exception e) {
System.err.println(ERROCCURS + e);
} finally {
if (inArchive != null) {
try {
inArchive.close();
} catch (IOException e) {
System.err.println(ERRCLOSING + e);
}
}
if (raf != null) {
try {
raf.close();
} catch (IOException e) {
System.err.println(ERRCLOSINGFLE + e);
}
}
}
}
public ArrayList<String> getfolders() {
boolean success = false;
RandomAccessFile raf = null;
IInArchive inArchive = null;
try {
raf = new RandomAccessFile(filename, RW);
inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
new RandomAccessFileInStream(raf));
int itemCount = inArchive.getNumberOfItems();
// From StackOverflow - could use IntStream,
// but that's Java 1.8 (using 1.7).
int[] fileindices = new int[itemCount];
for(int k = 0; k < fileindices.length; k++)
fileindices[k] = k;
inArchive.extract(fileindices, false,
new MyGetPathsCallback(inArchive));
} catch (SevenZipException e) {
System.err.println(SEVZERR);
// Get more information using extended method
e.printStackTraceExtended();
} catch (Exception e) {
System.err.println(ERROCCURS + e);
} finally {
if (inArchive != null) {
try {
inArchive.close();
} catch (IOException e) {
System.err.println(ERRCLOSING + e);
}
}
if (raf != null) {
try {
raf.close();
} catch (IOException e) {
System.err.println(ERRCLOSINGFLE + e);
}
}
}
return foldersx;
}
}
The method getfolders in the SevenZipThingExtract class is the extra method to get the list of folders. As noted in the jython code below, the limitations on the number of bytes and files to be compressed necessitates splitting larger files into chunks. Also, for my specific use case, I need to extract files to a specific folder and set of subfolders. My methodology is outlined in the comments in the jython code. The good news: if I get run over by a bus and the uncompression part of the program gets lost, people will be able to get the files back with some effort. The bad news: they will be cursing my headstone. You do the best you can.
The three jython modules - the first one, folderstozip.py is just constants:
#!java -jar C:\jython-2.7.0\jython.jar
# folderstozip.py
"""
Constants used in compression and
decompression.
"""
FRONTSLASH = '/'
BACKSLASH = '\\'
EMPTY = ''
SAMEFOLDER = './'
SAMEFOLDERWIN = u'.\\'
SPLITFILETRACKER = 'SPLITFILETRACKER.csv'
SPLITFILE = '{0:s}.{1:s}'
UCOMMA = u','
# 3rd party sevenZipJBindings library.
PATH7ZJB = 'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJB += '/Backup/sevenzipjbinding/lib/sevenzipjbinding.jar'
# OS specific 3rd party sevenZipJBindings library.
PATH7ZJBOSSPEC = r'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJBOSSPEC += '/Backup/sevenzipjbinding/lib/sevenzipjbinding-Windows-amd64.jar'
PROGFOLDER = 'C:/MSPROJECTS/EOMReconciliation/2016/03March/Backup'
PROGFOLDER += FRONTSLASH
# Informational messages.
WROTEFILE = 'Wrote file {:s}\n'
SPLITFILEMSG = 'Have now split {0:,d} bytes of file {1:s} into {2:d} {3:,d} chunks.\n'
DONESPLITTING = '\nDone splitting file'
FILESAFTERSPLIT = '\n{:d} files after split'
COMPRESSING = '\nCompressing file {:s} . . .\n'
DELETING = '\nDeleting file {:s} . . .\n'
DELETINGDIR = '\nNow deleting {:s} . . .\n'
# Room for 9999 file names.
UNIQUEX = '{0:05d}'
# XXX - multiple file archives limited to
# 10KB - reason unknown - crashes jvm
# with IInStream interface class not
# found.
# XXX - choked on 8700 bytes - try dropping
# this from 9500 to 8500.
MULTFILELIMIT = 8500
HALFLIMIT = MULTFILELIMIT/2
# About 50 splits for a 3GB file.
CHUNK = 2 ** 26
# Path plus split number.
FILEN = r'{0:s}.{1:03d}'
# Path plus basefilename.
FILEB = r'{0:s}{1:s}'
# Read/Write constants.
RB = 'rb'
WB = 'wb'
W = 'w'
# Filename plus split number.
ARCHIVEX = '{0:s}/{1:s}.7z'
# multifile archive
MULTARCHIVEX = '{0:s}/archive{1:03d}.7z'
MULTFILES = '. . . multiple files'
# File categories.
# Size less than HALFLIMIT.
SMALL = 'small'
# Size greater than or equal to HALFLIMIT but
# less than or equal to CHUNK.
MEDIUM = 'medium'
# Larger than CHUNK.
LARGE = 'large'
BASEPATH = 'basepath'
FILES = 'files'
# XXX - this folder has recognizable
# folder names within your domain
# space - mine are open pit mining
# area names.
BASEDIRS = ['Pit-1', 'Pit-2', 'Pit-3']
#!java -jar C:/jython-2.7.0/jython.jar
# sevenzipper.py
"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to 7zip up MineSight project
files.
"""
import folderstozip as fld
# Need to adjust path to get necessary jar imports.
import sys
# Need for os.path
import os
# Original path of file plus split number.
SPLITFILERECORD = '{0:s},{1:03d}'
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)
# java 7zip library
import SevenZipThing as z7thing
# For copying files to program
# directory and deleting the old
# ones where necessary.
import shutil
# For unique archive names.
import itertools
COUNTERX = itertools.count(0, 1)
def splitfile(originalfilepath, splitfilestrackerfile):
"""
Split file at (string) originalfilepath
into fld.CHUNK sized chunks and indicate
sequence by number in new split file
name.
Return generator of relative file paths
inside project folder.
originalfilepath is the path of the
file that needs to be split into parts.
splitfilestrackerfile is an open file
object used for tracking file splits
for later retrieval.
"""
sizeoffile = os.path.getsize(originalfilepath)
chunks = sizeoffile/fld.CHUNK + 1
# Counter.
i = 1
with open(originalfilepath, fld.RB) as f:
while i < chunks + 1:
with open(fld.FILEN.format(originalfilepath, i), fld.WB) as f2:
f2.write(f.read(fld.CHUNK))
print(fld.WROTEFILE.format(fld.FILEN.format(originalfilepath, i)))
print(fld.SPLITFILEMSG.format(f.tell(), originalfilepath, i, fld.CHUNK))
print >> splitfilestrackerfile, (SPLITFILERECORD.format(originalfilepath, i))
i += 1
print(fld.DONESPLITTING)
print(fld.FILESAFTERSPLIT.format(i - 1))
return (fld.FILEN.format(originalfilepath, x) for x in xrange(1, i))
def movefiles(movefilesx, intermediatepath):
"""
Move files from MineSight project directory
to program directory.
Return a list of base file names for the
moved files.
movefilesx is a generator of file paths.
intermediatepath is a string relative path
between the program folder and the sub-folder
of the MineSight directory (_msresources/06SOLIDS,
for example).
"""
# Move files to that folder.
movedfiles = []
for pathx in movefilesx:
shutil.move(pathx, fld.PROGFOLDER + intermediatepath +
os.path.basename(pathx))
movedfiles.append(intermediatepath + os.path.basename(pathx))
return movedfiles
def copyfiles(copyfilesx, intermediatepath):
"""
Copy files from MineSight project directory
to program directory.
Return a list of base file names for the
copied files.
copyfilesx is a generator of file paths.
intermediatepath is a string relative path
between the program folder and the sub-folder
of the MineSight directory (_msresources/06SOLIDS,
for example).
"""
# Copy files to that folder.
copiedfiles = []
for pathx in copyfilesx:
shutil.copyfile(pathx, fld.PROGFOLDER + intermediatepath +
os.path.basename(pathx))
copiedfiles.append(intermediatepath + os.path.basename(pathx))
return copiedfiles
def compressfilessingle(filestocompress, prefix, basedir):
"""
Compresses files into an archive.
This is for larger files that take up
an entire archive (7z file).
filestocompress is a list of paths of
files to be compressed. These files
reside inside the program directory.
prefix is a string path addition, usually
'./' that allows the function to deal
with relative paths for files that reside
in subfolders.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Side effect function.
"""
for pathx in filestocompress:
basename = os.path.split(pathx)[1]
# Need unique name for subfolder files with same names.
uniqueid = fld.UNIQUEX.format(COUNTERX.next())
uniquename = uniqueid + basename
print(fld.COMPRESSING.format(prefix + basename))
archx = z7thing(fld.ARCHIVEX.format(basedir, uniquename),
[prefix + basename])
archx.compress()
def compressfilesmultiple(filestocompress, indexx, basedir):
"""
Compresses files into an archive.
filestocompress is a list of paths of
files to be compressed. These files
reside inside the program directory.
indexx is an integer that gives the
archive a unique name.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Side effect function.
"""
print(fld.COMPRESSING.format(fld.MULTFILES))
archx = z7thing(fld.MULTARCHIVEX.format(basedir, indexx),
filestocompress)
archx.compress()
def segregatefiles(directoryx, basefiles):
"""
From a string directory path directoryx
and a list of base file names, returns
a dictionary of lists of files and their
sizes sorted on size and keyed on file
category.
"""
retval = {}
# Add separator to end of directory path.
directoryx += fld.FRONTSLASH
# Get all files in folder and their sizes.
allfiles = [(os.path.getsize(fld.FILEB.format(directoryx, filex)), filex)
for filex in basefiles]
retval[fld.SMALL] = [x for x in allfiles if x[0] < fld.HALFLIMIT]
retval[fld.SMALL].sort()
retval[fld.MEDIUM] = [x for x in allfiles if x[0] >= fld.HALFLIMIT and
x[0] <= fld.CHUNK]
retval[fld.MEDIUM].sort()
retval[fld.LARGE] = [x for x in allfiles if x[0] > fld.CHUNK]
retval[fld.LARGE].sort()
return retval
def deletefiles(movedfiles):
"""
Delete files that have been compressed.
movedfiles is a list of paths of
files that have been moved or copied to
the program directory for compression.
Side effect function.
"""
for pathx in movedfiles:
print(fld.DELETING.format(pathx))
os.remove(pathx)
def getsmallfilegroupings(smallfiles):
"""
Generator function that yields
a list of files whose sum is
less than the program's limit
for bytes to be archived in a
multiple file archive.
smallfiles is a list of two tuples
of (filesize in bytes, file path).
"""
lenx = len(smallfiles)
insidecounter1 = 0
insidecounter2 = 1
sumx = 0
while (insidecounter2 < (lenx + 1)):
sumx = sum(x[0] for x in smallfiles[insidecounter1:insidecounter2])
if sumx > fld.MULTFILELIMIT:
# Back up one.
insidecounter2 -= 1
yield (x[1] for x in smallfiles[insidecounter1:insidecounter2])
# Reset and advance counters.
sumx = 0
insidecounter1 = insidecounter2 + 1
insidecounter2 = insidecounter1 + 1
else:
insidecounter2 += 1
def compresslargefiles(largefiles, dirx, prefix, basedir, splitfilestrackerfile):
"""
Deal with compression of files that need to
be split prior to compression.
largefiles is a list of two tuples of file
sizes and names.
dirx is the directory (str) in which the files
are located.
prefix is a string prefix to augment path
identification for compression.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
splitfilestrackerfile is an open file
object used for tracking file splits
for later retrieval.
Side effect function.
"""
for filex in largefiles:
# Get generator of paths of splits.
splitfiles = splitfile(fld.FILEB.format(dirx, filex[1]),
splitfilestrackerfile)
movedfiles = movefiles(splitfiles, prefix)
compressfilessingle(movedfiles, prefix, basedir)
deletefiles(movedfiles)
def compressmediumfiles(mediumfiles, dirx, prefix, basedir):
"""
Deal with compression of files that need to
be compressed each to its own archive.
mediumfiles is a list of two tuples of file
sizes and paths.
dirx is the directory (str) in which the files
are located.
prefix is a string prefix to augment path
identification for compression.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Side effect function.
"""
filestocopy = (dirx + x[1] for x in mediumfiles)
copiedfiles = copyfiles(filestocopy, prefix)
compressfilessingle(copiedfiles, prefix, basedir)
deletefiles(copiedfiles)
def compresssmallfiles(smallfiles, dirx, prefix, indexx, basedir):
"""
Deal with compression of files that can be
compressed in groups.
mediumfiles is a list of two tuples of file
sizes and paths.
dirx is the directory (str) in which the files
are located.
prefix is a string prefix to augment path
identification for compression.
indexx is the current index that the 7zip
file counter (ensures unique archive name)
is on.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Returns integer for current archive counter
index.
"""
smallgroupings = getsmallfilegroupings(smallfiles)
while True:
try:
grouplittlefiles = smallgroupings.next()
littlefiles = (dirx + x for x in grouplittlefiles)
copiedfiles = copyfiles(littlefiles, prefix)
compressfilesmultiple(copiedfiles, indexx, basedir)
indexx += 1
deletefiles(copiedfiles)
except StopIteration:
break
return index
# XXX - hack
def matchbasedir(folderlist):
"""
Get MineSight project folder name
that matches a folder in the path
in question.
folderlist is a list (in order)
of directories in a path.
Returns string.
"""
for folderx in folderlist:
for projx in fld.BASEDIRS:
if projx == folderx:
return folderx
return None
def getbasedir(pathx):
"""
Returns two tuple of strings for
basedir and basefolder (project
directory name and base path under
project directory copied to program
directory).
pathx is the directory path being
processed (str).
"""
# basedir is project name (Fwaulu, for example).
foldernames = pathx.split(fld.FRONTSLASH)
basedir = matchbasedir(foldernames)
# Get directory under project directory.
# _msresources, for example.
idx = foldernames.index(basedir)
# Directory under program directory ./ for MineSight files.
basefolder = fld.SAMEFOLDER + fld.FRONTSLASH.join(foldernames[idx + 1:])
return basedir, basefolder
def dealwithtoplevel(firstdir):
"""
Compress top level files in the
MineSight project directory.
firstdir is the three tuple returned
from the os.walk() generator function.
Returns two tuple of integer smallfile
multifilecounter used for naming
multiple file archives and splitfilestrackerfile,
an open file object for tracking split
files for later reconstruction.
"""
# Top level files.
dirx = firstdir[0] + fld.FRONTSLASH
basedir, basefolder = getbasedir(dirx)
# File to track split files for later glueing back together.
splitfilestrackerfile = open(fld.SAMEFOLDER + basedir + fld.FRONTSLASH +
fld.SPLITFILETRACKER, fld.W)
firstdirfiles = segregatefiles(firstdir[0], firstdir[2])
compresslargefiles(firstdirfiles[fld.LARGE], dirx, fld.EMPTY, basedir,
splitfilestrackerfile)
compressmediumfiles(firstdirfiles[fld.MEDIUM], dirx, fld.EMPTY, basedir)
# This is for keeping track of
# archives with more than one file.
multifilecounter = 1
mulitfilecounter = compresssmallfiles(firstdirfiles[fld.SMALL], dirx,
fld.EMPTY, multifilecounter, basedir)
return multifilecounter, splitfilestrackerfile
def dealwithlowerleveldirectories(dirs, multifilecounter, splitfilestrackerfile):
"""
Finishes out compression of lower level
folders under top level MineSight project
directory.
dirs is a partially exhausted (one iteration)
os.walk() generator.
multifilecounter is an integer used for
naming multiple file archives.
splitfilestrackerfile is an open file
object used for tracking file splits
for later retrieval.
Returns orphanedfolders, a list of lower level
folders to be deleted at the end of the program
run.
"""
orphanedfolders = []
for dirx in dirs:
# XXX - hack - I hate dealing with Windows paths.
dirn = dirx[0].replace(fld.BACKSLASH, fld.FRONTSLASH)
diry = dirn + fld.FRONTSLASH
basedir, basefolder = getbasedir(diry)
# Create directory in program path.
fauxdir = fld.PROGFOLDER[:-1] + basefolder[1:-1]
os.mkdir(fauxdir)
orphanedfolders.append(fauxdir)
# Skip anything that doesn't have files.
if not dirx[2]:
continue
# Easiest way to do this might be
# to track directories and sort
# files according to size, then
# filter them accordingly.
dirfiles = segregatefiles(dirx[0], dirx[2])
compresslargefiles(dirfiles[fld.LARGE], diry, basefolder,
basedir, splitfilestrackerfile)
compressmediumfiles(dirfiles[fld.MEDIUM], diry, basefolder, basedir)
multifilecounter = compresssmallfiles(dirfiles[fld.SMALL], diry, basefolder,
multifilecounter, basedir)
splitfilestrackerfile.close()
return orphanedfolders
def walkdir(dirx):
"""
Traverse MineSight project directory,
7zipping everything along the way.
dirx is a string for the directory
to traverse.
Side effect function.
"""
dirs = os.walk(dirx)
# OK - os.walk returns generator that
# yields a tuple in the format
# (str path,
# [list of paths for directories under path],
# [list of filenames under path])
# Top level (Fwaulu, for instance).
# These files will not have a path
# prefix of any sort in their respective
# archives.
firstdir = dirs.next()
multifilecounter, splitfilestrackerfile = dealwithtoplevel(firstdir)
# All other files and folders.
orphanedfolders = dealwithlowerleveldirectories(dirs, multifilecounter,
splitfilestrackerfile)
# Delete lower level folders first - this is necessary.
orphanedfolders.reverse()
for orphanx in orphanedfolders:
print(fld.DELETINGDIR.format(orphanx))
os.rmdir(orphanx)
def cyclefolders(folderx):
"""
Wrapper function for compression
of folder folderx (string).
Side effect function.
"""
# 1) Set up empty project directory (ex: Fwaulu)
# in program directory.
# 2) For first set of files, use no prefix for
# 7zip archive storage (filename only).
# 3) Check for size of file.
# 4) If file is bigger than fld.CHUNK, split.
# 5) If file is smaller than fld.CHUNK, but bigger than
# MULTFILELIMIT, compress to one archive.
# 6) If file is smaller than fld.CHUNK, and smaller than
# MULTFILELIMIT, check subsequent files to determine
# files to include in archive. Keep track of file
# index that puts number of bytes over limit.
# 7) Compress multiple files to one archive - index
# archive to ensure unique name.
# 8) For all following sets of files, same process,
# but must prefix paths with SAMEFOLDER and any
# additional folder names.
foldertracker = []
# Make directory folder in program directory
# to hold 7zip files.
zipfolder = getbasedir(folderx)[0]
os.mkdir(zipfolder)
foldertracker.append(zipfolder)
walkdir(folderx)
print('\nDone')
cyclefolders is the overarching wrapper function for the module (compression operation).
#!java -jar C:\jython2.7.0\jython.jar
# unsevenzipper.py
"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to un-7zip archives.
"""
# Need to adjust path to get necessary jar imports.
# XXX - it might be cleaner to chain imports by using
# the sevenzipper (s7 alias) below to reference
# double imported modules. For development and
# convenience I reimported everything as though
# sevenzipper.py and unsevenzipper.py were separate
# operations.
import sys
import folderstozip as fld
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)
import os
import sevenzipper as s7
import SevenZipThingExtract
def subdirectoryornot(pathx):
"""
Boolean function that returns
True if string pathx is a
subdirectory of the MineSight
project folder and False if
the files belong directly to
the MineSight project folder.
"""
pathx = pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH)
pathlist = pathx.split(fld.BACKSLASH)
if len(pathlist) > 1:
return True
return False
def getdirectories(dirx):
"""
Get list of lists of directories
in path under project folder
from 7zip archives in project
folder for archives.
Returns two tuple of list and
dictionary indicating which
7z files are same directory
archives and which are archived
subdirectory files.
dirx is a string for the file
path of the directory to
be walked (./Fwaulu for example).
"""
dirs = os.walk(dirx)
# One level, no subfolders.
files = dirs.next()[2]
# Get directories first.
rawpaths = []
subdirornot = {}
for filex in files:
# Skip uncompressed split file tracker.
if filex == fld.SPLITFILETRACKER:
continue
# I don't know if it's a subdirectory or not, so I'll go with False.
s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex, dirx, False)
folders = list(s7tx.getfolders())
rawpaths.extend(folders)
# All the paths in folders have the same prefix -
# just do one.
subdirornot[filex] = subdirectoryornot(folders[0])
# Get just directories
justdirectories = [pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH).split(fld.BACKSLASH)[1:-1]
for pathx in rawpaths if pathx.split(fld.BACKSLASH)[1:-1]]
justdirectories = set([tuple(x) for x in justdirectories])
justdirectories = list(justdirectories)
justdirectories.sort()
return justdirectories, subdirornot
def makedirectories(dirn):
"""
Create directory paths within archive
project folder to accept uncompressed
files.
Returns subdirornot dictionary.
dirn is a string for the file
path of the directory to
be walked (./Fwaulu for example).
"""
justdirectories, subdirornot = getdirectories(dirn)
maxdepth = max(len(dirx) for dirx in justdirectories)
for x in xrange(0, maxdepth):
justdirectoriesii = set([tuple(dirx[0:x + 1]) for dirx in justdirectories
if len(dirx) >= x + 1])
for diry in justdirectoriesii:
dirw = dirn + fld.FRONTSLASH + fld.FRONTSLASH.join(diry)
os.mkdir(dirw)
return subdirornot
def extractfiles(dirx):
"""
Extract files from 7z files
in project archive folder.
Side effect function.
dirx is a string for the file
path of the directory to
be walked.
"""
subdirornot = makedirectories(dirx)
dirs = os.walk(dirx)
# One level, no subfolders.
files = dirs.next()[2]
for filex in files:
# Skip uncompressed split file tracker.
if filex == fld.SPLITFILETRACKER:
continue
s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex,
dirx, subdirornot[filex])
s7tx.extractfiles()
def gluetogethersplitfiles(dirx):
"""
Make split up files whole.
Side effect function.
dirx is the folder in which the split
files reside.
"""
# Glue together big files.
# Do this in a very controlling,
# structured way:
# 1) Read the split file tracker csv file.
# 2) Determine the number and names and paths
# of files to be reconstructed and the
# number of parts in each.
# 3) Check that everything is there for
# each file to be reconstructed.
# 4) Get the new relative path.
# 5) Glue back together programmatically.
splitfiles = []
# fld.SPLITFILETRACKER is structured as original path
# of file split, number of file split.
with open(fld.SAMEFOLDERWIN + dirx +
fld.FRONTSLASH + fld.SPLITFILETRACKER, 'r') as f:
for linex in f:
strippedline = [x.strip() for x in linex.split(fld.UCOMMA)]
splitfiles.append(tuple(strippedline))
orignames = [x[0] for x in splitfiles]
splitoriginals = set(orignames)
# Make dictionary that is easy to cycle through.
filesx = {}
for orig in splitoriginals:
basedir, basefolder = s7.getbasedir(orig)
filesx[orig] = {}
filesx[orig][fld.BASEPATH] = fld.SAMEFOLDER + basedir + basefolder[1:]
filesx[orig][fld.FILES] = (fld.SPLITFILE.format(filesx[orig][fld.BASEPATH], filex[1])
for filex in splitfiles if filex[0] == orig)
for orig in filesx:
with open(filesx[orig][fld.BASEPATH], fld.WB) as mainfile:
for filex in filesx[orig][fld.FILES]:
with open(filex, fld.RB) as splitfile:
mainfile.write(splitfile.read())
def restore(dirx):
"""
Restores MineSight project directory
inside program path.
dirx is a string for the directory
to be restored (./Fwaulu, for example).
Side effect function.
"""
extractfiles(dirx)
gluetogethersplitfiles(dirx)
print('Done')
restore is the main function for the module (uncompression).
Notes:
1) I don't have admin rights at work and did not have javac (the compiler for java) available. You can download an SDK or SRE java package from Oracle that has it. Without admin rights, you can't install it normally. Still you can use it. My compilation went something like this:
<path to downloaded JDK>/bin/javac -cp <path to downloaded 7-ZipJBinding>/lib/* <myclassname>.java
2) I've left all the split up files and 7z archives in the folder where I decompress my files and recombine the split files. This takes up a lot of space depending on what you're working with. If space is at a premium, you probably want to write jython code to move or delete the archives after uncompressing them.
3) The most time consuming part of runtime is the compression, uncompression, and splitting and recombining of split files. Porting some of this to java (instead of jython) might speed things up. I code faster and generally better in jython. Also, my objective was control, not speed. YMMV (your mileage may vary) with this approach. There are far better general purpose ones.
Thanks for stopping by.
7-Zip-JBinding is written using java Interfaces that are structured pretty specifically. I did not venture too far away from the examples given in the 7-Zip-JBinding documentation. I smithed two modules for my own purposes, compressing and uncompressing, and present them (java code) below. The decompression one has a separate method for retrieving paths of the compressed files. This is not efficient, but for what I need to do, and for the limitations of the library and the approach, it works out for the best.
import java.io.IOException;
import java.io.RandomAccessFile;
import net.sf.sevenzipjbinding.IOutCreateArchive7z;
import net.sf.sevenzipjbinding.IOutCreateCallback;
import net.sf.sevenzipjbinding.IOutItem7z;
import net.sf.sevenzipjbinding.ISequentialInStream;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.OutItemFactory;
import net.sf.sevenzipjbinding.impl.RandomAccessFileOutStream;
import net.sf.sevenzipjbinding.util.ByteArrayStream;
/* Off StackOverflow - works for getting
* file content/bytes from path */
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;
public class SevenZipThing {
private static final String RETCHAR = "\n";
private static final String INTFMT = "%,d";
private static final String BYTESTOCOMPRESS = " bytes total to compress\n";
private static final String ERROCCURS = "Error occurs: ";
private static final String COMPRESSFILE = "\nCompressing file ";
private static final String RW = "rw";
private static final int LVL = 5;
private static final String SEVZERR = "7z-Error occurs:";
private static final String ERRCLOSING = "Error closing archive: ";
private static final String ERRCLOSINGFLE = "Error closing file: ";
private static final String SUCCESS = "\nCompression operation succeeded\n";
private String filename;
/* String[] array conversion from jython list
* implicit and poses no problems (JKD7) */
private String[] pathsx;
public SevenZipThing(String filename, String[] pathsx) {
this.filename = filename;
this.pathsx = pathsx;
}
/**
* The callback provides information about archive items.
*/
/**
* I copied this straight from the sevenZipJBinding's author's
* code - but I haven't put much in to deal with messaging
* or error handling
* */
private final class MyCreateCallback
implements IOutCreateCallback<IOutItem7z> {
public void setOperationResult(boolean operationResultOk)
throws SevenZipException {
// Track each operation result here
}
public void setTotal(long total) throws SevenZipException {
// Track operation progress here
System.out.print(RETCHAR + String.format(INTFMT, total) +
BYTESTOCOMPRESS);
}
public void setCompleted(long complete) throws SevenZipException {
// Track operation progress here
}
public IOutItem7z getItemInformation(int index,
OutItemFactory<IOutItem7z> outItemFactory) {
IOutItem7z item = outItemFactory.createOutItem();
Path path = Paths.get(pathsx[index]);
item.setPropertyPath(pathsx[index]);
try {
// Java arrays are limited to 2 ** 31 items - small.
byte[] data = Files.readAllBytes(path);
item.setDataSize((long) data.length);
return item;
// XXX - I could do a lot better than this (error handling).
} catch (Exception e) {
System.err.println(ERROCCURS + e);
}
return null;
}
public ISequentialInStream getStream(int i)
throws SevenZipException {
Path path = Paths.get(pathsx[i]);
try {
byte[] data = Files.readAllBytes(path);
System.out.println(COMPRESSFILE + path);
return new ByteArrayStream(data, true);
} catch (Exception e) {
System.err.println(ERROCCURS + e);
}
return null;
}
}
public void compress() {
/* Mostly copied from sevenZipJBinding's author's code -
* I made the compress method public to work from jython.
* Also, I deal with all of the file listing in jython
* and just pass a list to this class. */
boolean success = false;
RandomAccessFile raf = null;
IOutCreateArchive7z outArchive = null;
try {
raf = new RandomAccessFile(filename, RW);
// Open out-archive object
outArchive = SevenZip.openOutArchive7z();
// Configure archive
outArchive.setLevel(LVL);
outArchive.setSolid(true);
// All available processors.
outArchive.setThreadCount(0);
// Create archive
outArchive.createArchive(new RandomAccessFileOutStream(raf),
pathsx.length, new MyCreateCallback());
success = true;
} catch (SevenZipException e) {
System.err.println(SEVZERR);
// Get more information using extended method
e.printStackTraceExtended();
} catch (Exception e) {
System.err.println(ERROCCURS + e);
} finally {
if (outArchive != null) {
try {
outArchive.close();
} catch (IOException e) {
System.err.println(ERRCLOSING + e);
success = false;
}
}
if (raf != null) {
try {
raf.close();
} catch (IOException e) {
System.err.println(ERRCLOSINGFLE + e);
success = false;
}
}
}
if (success) {
System.out.println(SUCCESS);
}
}
}
import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.util.Arrays;
import java.util.ArrayList;
import net.sf.sevenzipjbinding.IInArchive;
import net.sf.sevenzipjbinding.PropID;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.RandomAccessFileInStream;
import net.sf.sevenzipjbinding.IArchiveExtractCallback;
import net.sf.sevenzipjbinding.ExtractOperationResult;
import net.sf.sevenzipjbinding.ExtractAskMode;
import net.sf.sevenzipjbinding.ISequentialOutStream;
/* 7z archive format */
/* SEVEN_ZIP is the one I want */
import net.sf.sevenzipjbinding.ArchiveFormat;
public class SevenZipThingExtract {
private String filename;
private String extractdirectory;
private ArrayList<String> foldersx = null;
private boolean subdirectory = false;
private static final String ERROPENINGFLE = "Error opening file: ";
private static final String ERRWRITINGFLE = "Error writing to file: ";
private static final String EXTERR = "Extraction error";
private static final String INFOFMT = "%9X | %10s | %s";
private static final String RETCHAR = "\n";
private static final String INTFMT = "%,d";
private static final String BYTESTOEXTRACT = " bytes total to extract\n";
private static final String RW = "rw";
private static final String BACKSLASH = "\\";
private static final String SEVZERR = "7z-Error occurs:";
private static final String ERROCCURS = "Error occurs: ";
private static final String ERRCLOSING = "Error closing archive: ";
private static final String ERRCLOSINGFLE = "Error closing file: ";
public SevenZipThingExtract(String filename, String extractdirectory,
boolean subdirectory) {
this.filename = filename;
foldersx = new ArrayList<String>();
this.foldersx = foldersx;
this.extractdirectory = extractdirectory;
this.subdirectory = subdirectory;
}
private final class MyExtractCallback
implements IArchiveExtractCallback {
// Copied mostly from example.
private int hash = 0;
private int size = 0;
private int index;
private boolean skipExtraction;
private IInArchive inArchive;
private OutputStream outputStream;
private File file;
public MyExtractCallback(IInArchive inArchive) {
this.inArchive = inArchive;
}
@Override
public ISequentialOutStream getStream(int index,
ExtractAskMode extractAskMode)
throws SevenZipException {
this.index = index;
// I'm not skipping anything.
skipExtraction = (Boolean) false;
String path = (String) inArchive.getProperty(index, PropID.PATH);
// Try preprending extractdirectory.
if (subdirectory) {
path = extractdirectory + BACKSLASH + path.substring(2);
} else {
path = extractdirectory + BACKSLASH + path;
}
file = new File(path);
try {
outputStream = new FileOutputStream(file);
} catch (FileNotFoundException e) {
throw new SevenZipException(ERROPENINGFLE
+ file.getAbsolutePath(), e);
}
return new ISequentialOutStream() {
public int write(byte[] data) throws SevenZipException {
try {
outputStream.write(data);
} catch (IOException e) {
throw new SevenZipException(ERRWRITINGFLE
+ file.getAbsolutePath());
}
return data.length; // Return amount of consumed data
}
};
}
public void prepareOperation(ExtractAskMode extractAskMode)
throws SevenZipException {
}
public void setOperationResult(ExtractOperationResult extractOperationResult)
throws SevenZipException {
// Track each operation result here
if (extractOperationResult != ExtractOperationResult.OK) {
System.err.println(EXTERR);
} else {
System.out.println(String.format(INFOFMT, hash, size,//
inArchive.getProperty(index, PropID.PATH)));
hash = 0;
size = 0;
}
}
public void setTotal(long total) throws SevenZipException {
System.out.print(RETCHAR + String.format(INTFMT, total) +
BYTESTOEXTRACT);
}
public void setCompleted(long complete) throws SevenZipException {
// Track operation progress here
}
}
private final class MyGetPathsCallback
implements IArchiveExtractCallback {
// Copied mostly from example.
private int hash = 0;
private int size = 0;
private int index;
private boolean skipExtraction;
private IInArchive inArchive;
public MyGetPathsCallback(IInArchive inArchive) {
this.inArchive = inArchive;
}
public ISequentialOutStream getStream(int index,
ExtractAskMode extractAskMode)
throws SevenZipException {
this.index = index;
// I'm not skipping anything.
skipExtraction = (Boolean) false;
String path = (String) inArchive.getProperty(index,
PropID.PATH);
foldersx.add(path);
return new ISequentialOutStream() {
public int write(byte[] data) throws SevenZipException {
hash ^= Arrays.hashCode(data);
size += data.length;
// Return amount of processed data
return data.length;
}
};
}
public void prepareOperation(ExtractAskMode extractAskMode)
throws SevenZipException {
}
public void setOperationResult(ExtractOperationResult extractOperationResult)
throws SevenZipException {
// Track each operation result here
if (extractOperationResult != ExtractOperationResult.OK) {
System.err.println(EXTERR);
} else {
System.out.println(String.format(INFOFMT, hash, size,
inArchive.getProperty(index, PropID.PATH)));
hash = 0;
size = 0;
}
}
public void setTotal(long total) throws SevenZipException {
System.out.print(RETCHAR + String.format(INTFMT, total) +
BYTESTOEXTRACT);
}
public void setCompleted(long complete) throws SevenZipException {
// Track operation progress here
}
}
public void extractfiles() {
boolean success = false;
RandomAccessFile raf = null;
IInArchive inArchive = null;
try {
raf = new RandomAccessFile(filename, RW);
inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
new RandomAccessFileInStream(raf));
int itemCount = inArchive.getNumberOfItems();
// From StackOverflow - could use IntStream,
// but that's Java 1.8 (using 1.7).
int[] fileindices = new int[itemCount];
for(int k = 0; k < fileindices.length; k++)
fileindices[k] = k;
inArchive.extract(fileindices, false,
new MyExtractCallback(inArchive));
} catch (SevenZipException e) {
System.err.println(SEVZERR);
// Get more information using extended method
e.printStackTraceExtended();
} catch (Exception e) {
System.err.println(ERROCCURS + e);
} finally {
if (inArchive != null) {
try {
inArchive.close();
} catch (IOException e) {
System.err.println(ERRCLOSING + e);
}
}
if (raf != null) {
try {
raf.close();
} catch (IOException e) {
System.err.println(ERRCLOSINGFLE + e);
}
}
}
}
public ArrayList<String> getfolders() {
boolean success = false;
RandomAccessFile raf = null;
IInArchive inArchive = null;
try {
raf = new RandomAccessFile(filename, RW);
inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
new RandomAccessFileInStream(raf));
int itemCount = inArchive.getNumberOfItems();
// From StackOverflow - could use IntStream,
// but that's Java 1.8 (using 1.7).
int[] fileindices = new int[itemCount];
for(int k = 0; k < fileindices.length; k++)
fileindices[k] = k;
inArchive.extract(fileindices, false,
new MyGetPathsCallback(inArchive));
} catch (SevenZipException e) {
System.err.println(SEVZERR);
// Get more information using extended method
e.printStackTraceExtended();
} catch (Exception e) {
System.err.println(ERROCCURS + e);
} finally {
if (inArchive != null) {
try {
inArchive.close();
} catch (IOException e) {
System.err.println(ERRCLOSING + e);
}
}
if (raf != null) {
try {
raf.close();
} catch (IOException e) {
System.err.println(ERRCLOSINGFLE + e);
}
}
}
return foldersx;
}
}
The method getfolders in the SevenZipThingExtract class is the extra method to get the list of folders. As noted in the jython code below, the limitations on the number of bytes and files to be compressed necessitates splitting larger files into chunks. Also, for my specific use case, I need to extract files to a specific folder and set of subfolders. My methodology is outlined in the comments in the jython code. The good news: if I get run over by a bus and the uncompression part of the program gets lost, people will be able to get the files back with some effort. The bad news: they will be cursing my headstone. You do the best you can.
The three jython modules - the first one, folderstozip.py is just constants:
#!java -jar C:\jython-2.7.0\jython.jar
# folderstozip.py
"""
Constants used in compression and
decompression.
"""
FRONTSLASH = '/'
BACKSLASH = '\\'
EMPTY = ''
SAMEFOLDER = './'
SAMEFOLDERWIN = u'.\\'
SPLITFILETRACKER = 'SPLITFILETRACKER.csv'
SPLITFILE = '{0:s}.{1:s}'
UCOMMA = u','
# 3rd party sevenZipJBindings library.
PATH7ZJB = 'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJB += '/Backup/sevenzipjbinding/lib/sevenzipjbinding.jar'
# OS specific 3rd party sevenZipJBindings library.
PATH7ZJBOSSPEC = r'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJBOSSPEC += '/Backup/sevenzipjbinding/lib/sevenzipjbinding-Windows-amd64.jar'
PROGFOLDER = 'C:/MSPROJECTS/EOMReconciliation/2016/03March/Backup'
PROGFOLDER += FRONTSLASH
# Informational messages.
WROTEFILE = 'Wrote file {:s}\n'
SPLITFILEMSG = 'Have now split {0:,d} bytes of file {1:s} into {2:d} {3:,d} chunks.\n'
DONESPLITTING = '\nDone splitting file'
FILESAFTERSPLIT = '\n{:d} files after split'
COMPRESSING = '\nCompressing file {:s} . . .\n'
DELETING = '\nDeleting file {:s} . . .\n'
DELETINGDIR = '\nNow deleting {:s} . . .\n'
# Room for 9999 file names.
UNIQUEX = '{0:05d}'
# XXX - multiple file archives limited to
# 10KB - reason unknown - crashes jvm
# with IInStream interface class not
# found.
# XXX - choked on 8700 bytes - try dropping
# this from 9500 to 8500.
MULTFILELIMIT = 8500
HALFLIMIT = MULTFILELIMIT/2
# About 50 splits for a 3GB file.
CHUNK = 2 ** 26
# Path plus split number.
FILEN = r'{0:s}.{1:03d}'
# Path plus basefilename.
FILEB = r'{0:s}{1:s}'
# Read/Write constants.
RB = 'rb'
WB = 'wb'
W = 'w'
# Filename plus split number.
ARCHIVEX = '{0:s}/{1:s}.7z'
# multifile archive
MULTARCHIVEX = '{0:s}/archive{1:03d}.7z'
MULTFILES = '. . . multiple files'
# File categories.
# Size less than HALFLIMIT.
SMALL = 'small'
# Size greater than or equal to HALFLIMIT but
# less than or equal to CHUNK.
MEDIUM = 'medium'
# Larger than CHUNK.
LARGE = 'large'
BASEPATH = 'basepath'
FILES = 'files'
# XXX - this folder has recognizable
# folder names within your domain
# space - mine are open pit mining
# area names.
BASEDIRS = ['Pit-1', 'Pit-2', 'Pit-3']
#!java -jar C:/jython-2.7.0/jython.jar
# sevenzipper.py
"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to 7zip up MineSight project
files.
"""
import folderstozip as fld
# Need to adjust path to get necessary jar imports.
import sys
# Need for os.path
import os
# Original path of file plus split number.
SPLITFILERECORD = '{0:s},{1:03d}'
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)
# java 7zip library
import SevenZipThing as z7thing
# For copying files to program
# directory and deleting the old
# ones where necessary.
import shutil
# For unique archive names.
import itertools
COUNTERX = itertools.count(0, 1)
def splitfile(originalfilepath, splitfilestrackerfile):
"""
Split file at (string) originalfilepath
into fld.CHUNK sized chunks and indicate
sequence by number in new split file
name.
Return generator of relative file paths
inside project folder.
originalfilepath is the path of the
file that needs to be split into parts.
splitfilestrackerfile is an open file
object used for tracking file splits
for later retrieval.
"""
sizeoffile = os.path.getsize(originalfilepath)
chunks = sizeoffile/fld.CHUNK + 1
# Counter.
i = 1
with open(originalfilepath, fld.RB) as f:
while i < chunks + 1:
with open(fld.FILEN.format(originalfilepath, i), fld.WB) as f2:
f2.write(f.read(fld.CHUNK))
print(fld.WROTEFILE.format(fld.FILEN.format(originalfilepath, i)))
print(fld.SPLITFILEMSG.format(f.tell(), originalfilepath, i, fld.CHUNK))
print >> splitfilestrackerfile, (SPLITFILERECORD.format(originalfilepath, i))
i += 1
print(fld.DONESPLITTING)
print(fld.FILESAFTERSPLIT.format(i - 1))
return (fld.FILEN.format(originalfilepath, x) for x in xrange(1, i))
def movefiles(movefilesx, intermediatepath):
"""
Move files from MineSight project directory
to program directory.
Return a list of base file names for the
moved files.
movefilesx is a generator of file paths.
intermediatepath is a string relative path
between the program folder and the sub-folder
of the MineSight directory (_msresources/06SOLIDS,
for example).
"""
# Move files to that folder.
movedfiles = []
for pathx in movefilesx:
shutil.move(pathx, fld.PROGFOLDER + intermediatepath +
os.path.basename(pathx))
movedfiles.append(intermediatepath + os.path.basename(pathx))
return movedfiles
def copyfiles(copyfilesx, intermediatepath):
"""
Copy files from MineSight project directory
to program directory.
Return a list of base file names for the
copied files.
copyfilesx is a generator of file paths.
intermediatepath is a string relative path
between the program folder and the sub-folder
of the MineSight directory (_msresources/06SOLIDS,
for example).
"""
# Copy files to that folder.
copiedfiles = []
for pathx in copyfilesx:
shutil.copyfile(pathx, fld.PROGFOLDER + intermediatepath +
os.path.basename(pathx))
copiedfiles.append(intermediatepath + os.path.basename(pathx))
return copiedfiles
def compressfilessingle(filestocompress, prefix, basedir):
"""
Compresses files into an archive.
This is for larger files that take up
an entire archive (7z file).
filestocompress is a list of paths of
files to be compressed. These files
reside inside the program directory.
prefix is a string path addition, usually
'./' that allows the function to deal
with relative paths for files that reside
in subfolders.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Side effect function.
"""
for pathx in filestocompress:
basename = os.path.split(pathx)[1]
# Need unique name for subfolder files with same names.
uniqueid = fld.UNIQUEX.format(COUNTERX.next())
uniquename = uniqueid + basename
print(fld.COMPRESSING.format(prefix + basename))
archx = z7thing(fld.ARCHIVEX.format(basedir, uniquename),
[prefix + basename])
archx.compress()
def compressfilesmultiple(filestocompress, indexx, basedir):
"""
Compresses files into an archive.
filestocompress is a list of paths of
files to be compressed. These files
reside inside the program directory.
indexx is an integer that gives the
archive a unique name.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Side effect function.
"""
print(fld.COMPRESSING.format(fld.MULTFILES))
archx = z7thing(fld.MULTARCHIVEX.format(basedir, indexx),
filestocompress)
archx.compress()
def segregatefiles(directoryx, basefiles):
"""
From a string directory path directoryx
and a list of base file names, returns
a dictionary of lists of files and their
sizes sorted on size and keyed on file
category.
"""
retval = {}
# Add separator to end of directory path.
directoryx += fld.FRONTSLASH
# Get all files in folder and their sizes.
allfiles = [(os.path.getsize(fld.FILEB.format(directoryx, filex)), filex)
for filex in basefiles]
retval[fld.SMALL] = [x for x in allfiles if x[0] < fld.HALFLIMIT]
retval[fld.SMALL].sort()
retval[fld.MEDIUM] = [x for x in allfiles if x[0] >= fld.HALFLIMIT and
x[0] <= fld.CHUNK]
retval[fld.MEDIUM].sort()
retval[fld.LARGE] = [x for x in allfiles if x[0] > fld.CHUNK]
retval[fld.LARGE].sort()
return retval
def deletefiles(movedfiles):
"""
Delete files that have been compressed.
movedfiles is a list of paths of
files that have been moved or copied to
the program directory for compression.
Side effect function.
"""
for pathx in movedfiles:
print(fld.DELETING.format(pathx))
os.remove(pathx)
def getsmallfilegroupings(smallfiles):
"""
Generator function that yields
a list of files whose sum is
less than the program's limit
for bytes to be archived in a
multiple file archive.
smallfiles is a list of two tuples
of (filesize in bytes, file path).
"""
lenx = len(smallfiles)
insidecounter1 = 0
insidecounter2 = 1
sumx = 0
while (insidecounter2 < (lenx + 1)):
sumx = sum(x[0] for x in smallfiles[insidecounter1:insidecounter2])
if sumx > fld.MULTFILELIMIT:
# Back up one.
insidecounter2 -= 1
yield (x[1] for x in smallfiles[insidecounter1:insidecounter2])
# Reset and advance counters.
sumx = 0
insidecounter1 = insidecounter2 + 1
insidecounter2 = insidecounter1 + 1
else:
insidecounter2 += 1
def compresslargefiles(largefiles, dirx, prefix, basedir, splitfilestrackerfile):
"""
Deal with compression of files that need to
be split prior to compression.
largefiles is a list of two tuples of file
sizes and names.
dirx is the directory (str) in which the files
are located.
prefix is a string prefix to augment path
identification for compression.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
splitfilestrackerfile is an open file
object used for tracking file splits
for later retrieval.
Side effect function.
"""
for filex in largefiles:
# Get generator of paths of splits.
splitfiles = splitfile(fld.FILEB.format(dirx, filex[1]),
splitfilestrackerfile)
movedfiles = movefiles(splitfiles, prefix)
compressfilessingle(movedfiles, prefix, basedir)
deletefiles(movedfiles)
def compressmediumfiles(mediumfiles, dirx, prefix, basedir):
"""
Deal with compression of files that need to
be compressed each to its own archive.
mediumfiles is a list of two tuples of file
sizes and paths.
dirx is the directory (str) in which the files
are located.
prefix is a string prefix to augment path
identification for compression.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Side effect function.
"""
filestocopy = (dirx + x[1] for x in mediumfiles)
copiedfiles = copyfiles(filestocopy, prefix)
compressfilessingle(copiedfiles, prefix, basedir)
deletefiles(copiedfiles)
def compresssmallfiles(smallfiles, dirx, prefix, indexx, basedir):
"""
Deal with compression of files that can be
compressed in groups.
mediumfiles is a list of two tuples of file
sizes and paths.
dirx is the directory (str) in which the files
are located.
prefix is a string prefix to augment path
identification for compression.
indexx is the current index that the 7zip
file counter (ensures unique archive name)
is on.
basedir is the name of the main MineSight
project directory (Fwaulu, for example).
Returns integer for current archive counter
index.
"""
smallgroupings = getsmallfilegroupings(smallfiles)
while True:
try:
grouplittlefiles = smallgroupings.next()
littlefiles = (dirx + x for x in grouplittlefiles)
copiedfiles = copyfiles(littlefiles, prefix)
compressfilesmultiple(copiedfiles, indexx, basedir)
indexx += 1
deletefiles(copiedfiles)
except StopIteration:
break
return index
# XXX - hack
def matchbasedir(folderlist):
"""
Get MineSight project folder name
that matches a folder in the path
in question.
folderlist is a list (in order)
of directories in a path.
Returns string.
"""
for folderx in folderlist:
for projx in fld.BASEDIRS:
if projx == folderx:
return folderx
return None
def getbasedir(pathx):
"""
Returns two tuple of strings for
basedir and basefolder (project
directory name and base path under
project directory copied to program
directory).
pathx is the directory path being
processed (str).
"""
# basedir is project name (Fwaulu, for example).
foldernames = pathx.split(fld.FRONTSLASH)
basedir = matchbasedir(foldernames)
# Get directory under project directory.
# _msresources, for example.
idx = foldernames.index(basedir)
# Directory under program directory ./ for MineSight files.
basefolder = fld.SAMEFOLDER + fld.FRONTSLASH.join(foldernames[idx + 1:])
return basedir, basefolder
def dealwithtoplevel(firstdir):
"""
Compress top level files in the
MineSight project directory.
firstdir is the three tuple returned
from the os.walk() generator function.
Returns two tuple of integer smallfile
multifilecounter used for naming
multiple file archives and splitfilestrackerfile,
an open file object for tracking split
files for later reconstruction.
"""
# Top level files.
dirx = firstdir[0] + fld.FRONTSLASH
basedir, basefolder = getbasedir(dirx)
# File to track split files for later glueing back together.
splitfilestrackerfile = open(fld.SAMEFOLDER + basedir + fld.FRONTSLASH +
fld.SPLITFILETRACKER, fld.W)
firstdirfiles = segregatefiles(firstdir[0], firstdir[2])
compresslargefiles(firstdirfiles[fld.LARGE], dirx, fld.EMPTY, basedir,
splitfilestrackerfile)
compressmediumfiles(firstdirfiles[fld.MEDIUM], dirx, fld.EMPTY, basedir)
# This is for keeping track of
# archives with more than one file.
multifilecounter = 1
mulitfilecounter = compresssmallfiles(firstdirfiles[fld.SMALL], dirx,
fld.EMPTY, multifilecounter, basedir)
return multifilecounter, splitfilestrackerfile
def dealwithlowerleveldirectories(dirs, multifilecounter, splitfilestrackerfile):
"""
Finishes out compression of lower level
folders under top level MineSight project
directory.
dirs is a partially exhausted (one iteration)
os.walk() generator.
multifilecounter is an integer used for
naming multiple file archives.
splitfilestrackerfile is an open file
object used for tracking file splits
for later retrieval.
Returns orphanedfolders, a list of lower level
folders to be deleted at the end of the program
run.
"""
orphanedfolders = []
for dirx in dirs:
# XXX - hack - I hate dealing with Windows paths.
dirn = dirx[0].replace(fld.BACKSLASH, fld.FRONTSLASH)
diry = dirn + fld.FRONTSLASH
basedir, basefolder = getbasedir(diry)
# Create directory in program path.
fauxdir = fld.PROGFOLDER[:-1] + basefolder[1:-1]
os.mkdir(fauxdir)
orphanedfolders.append(fauxdir)
# Skip anything that doesn't have files.
if not dirx[2]:
continue
# Easiest way to do this might be
# to track directories and sort
# files according to size, then
# filter them accordingly.
dirfiles = segregatefiles(dirx[0], dirx[2])
compresslargefiles(dirfiles[fld.LARGE], diry, basefolder,
basedir, splitfilestrackerfile)
compressmediumfiles(dirfiles[fld.MEDIUM], diry, basefolder, basedir)
multifilecounter = compresssmallfiles(dirfiles[fld.SMALL], diry, basefolder,
multifilecounter, basedir)
splitfilestrackerfile.close()
return orphanedfolders
def walkdir(dirx):
"""
Traverse MineSight project directory,
7zipping everything along the way.
dirx is a string for the directory
to traverse.
Side effect function.
"""
dirs = os.walk(dirx)
# OK - os.walk returns generator that
# yields a tuple in the format
# (str path,
# [list of paths for directories under path],
# [list of filenames under path])
# Top level (Fwaulu, for instance).
# These files will not have a path
# prefix of any sort in their respective
# archives.
firstdir = dirs.next()
multifilecounter, splitfilestrackerfile = dealwithtoplevel(firstdir)
# All other files and folders.
orphanedfolders = dealwithlowerleveldirectories(dirs, multifilecounter,
splitfilestrackerfile)
# Delete lower level folders first - this is necessary.
orphanedfolders.reverse()
for orphanx in orphanedfolders:
print(fld.DELETINGDIR.format(orphanx))
os.rmdir(orphanx)
def cyclefolders(folderx):
"""
Wrapper function for compression
of folder folderx (string).
Side effect function.
"""
# 1) Set up empty project directory (ex: Fwaulu)
# in program directory.
# 2) For first set of files, use no prefix for
# 7zip archive storage (filename only).
# 3) Check for size of file.
# 4) If file is bigger than fld.CHUNK, split.
# 5) If file is smaller than fld.CHUNK, but bigger than
# MULTFILELIMIT, compress to one archive.
# 6) If file is smaller than fld.CHUNK, and smaller than
# MULTFILELIMIT, check subsequent files to determine
# files to include in archive. Keep track of file
# index that puts number of bytes over limit.
# 7) Compress multiple files to one archive - index
# archive to ensure unique name.
# 8) For all following sets of files, same process,
# but must prefix paths with SAMEFOLDER and any
# additional folder names.
foldertracker = []
# Make directory folder in program directory
# to hold 7zip files.
zipfolder = getbasedir(folderx)[0]
os.mkdir(zipfolder)
foldertracker.append(zipfolder)
walkdir(folderx)
print('\nDone')
cyclefolders is the overarching wrapper function for the module (compression operation).
#!java -jar C:\jython2.7.0\jython.jar
# unsevenzipper.py
"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to un-7zip archives.
"""
# Need to adjust path to get necessary jar imports.
# XXX - it might be cleaner to chain imports by using
# the sevenzipper (s7 alias) below to reference
# double imported modules. For development and
# convenience I reimported everything as though
# sevenzipper.py and unsevenzipper.py were separate
# operations.
import sys
import folderstozip as fld
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)
import os
import sevenzipper as s7
import SevenZipThingExtract
def subdirectoryornot(pathx):
"""
Boolean function that returns
True if string pathx is a
subdirectory of the MineSight
project folder and False if
the files belong directly to
the MineSight project folder.
"""
pathx = pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH)
pathlist = pathx.split(fld.BACKSLASH)
if len(pathlist) > 1:
return True
return False
def getdirectories(dirx):
"""
Get list of lists of directories
in path under project folder
from 7zip archives in project
folder for archives.
Returns two tuple of list and
dictionary indicating which
7z files are same directory
archives and which are archived
subdirectory files.
dirx is a string for the file
path of the directory to
be walked (./Fwaulu for example).
"""
dirs = os.walk(dirx)
# One level, no subfolders.
files = dirs.next()[2]
# Get directories first.
rawpaths = []
subdirornot = {}
for filex in files:
# Skip uncompressed split file tracker.
if filex == fld.SPLITFILETRACKER:
continue
# I don't know if it's a subdirectory or not, so I'll go with False.
s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex, dirx, False)
folders = list(s7tx.getfolders())
rawpaths.extend(folders)
# All the paths in folders have the same prefix -
# just do one.
subdirornot[filex] = subdirectoryornot(folders[0])
# Get just directories
justdirectories = [pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH).split(fld.BACKSLASH)[1:-1]
for pathx in rawpaths if pathx.split(fld.BACKSLASH)[1:-1]]
justdirectories = set([tuple(x) for x in justdirectories])
justdirectories = list(justdirectories)
justdirectories.sort()
return justdirectories, subdirornot
def makedirectories(dirn):
"""
Create directory paths within archive
project folder to accept uncompressed
files.
Returns subdirornot dictionary.
dirn is a string for the file
path of the directory to
be walked (./Fwaulu for example).
"""
justdirectories, subdirornot = getdirectories(dirn)
maxdepth = max(len(dirx) for dirx in justdirectories)
for x in xrange(0, maxdepth):
justdirectoriesii = set([tuple(dirx[0:x + 1]) for dirx in justdirectories
if len(dirx) >= x + 1])
for diry in justdirectoriesii:
dirw = dirn + fld.FRONTSLASH + fld.FRONTSLASH.join(diry)
os.mkdir(dirw)
return subdirornot
def extractfiles(dirx):
"""
Extract files from 7z files
in project archive folder.
Side effect function.
dirx is a string for the file
path of the directory to
be walked.
"""
subdirornot = makedirectories(dirx)
dirs = os.walk(dirx)
# One level, no subfolders.
files = dirs.next()[2]
for filex in files:
# Skip uncompressed split file tracker.
if filex == fld.SPLITFILETRACKER:
continue
s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex,
dirx, subdirornot[filex])
s7tx.extractfiles()
def gluetogethersplitfiles(dirx):
"""
Make split up files whole.
Side effect function.
dirx is the folder in which the split
files reside.
"""
# Glue together big files.
# Do this in a very controlling,
# structured way:
# 1) Read the split file tracker csv file.
# 2) Determine the number and names and paths
# of files to be reconstructed and the
# number of parts in each.
# 3) Check that everything is there for
# each file to be reconstructed.
# 4) Get the new relative path.
# 5) Glue back together programmatically.
splitfiles = []
# fld.SPLITFILETRACKER is structured as original path
# of file split, number of file split.
with open(fld.SAMEFOLDERWIN + dirx +
fld.FRONTSLASH + fld.SPLITFILETRACKER, 'r') as f:
for linex in f:
strippedline = [x.strip() for x in linex.split(fld.UCOMMA)]
splitfiles.append(tuple(strippedline))
orignames = [x[0] for x in splitfiles]
splitoriginals = set(orignames)
# Make dictionary that is easy to cycle through.
filesx = {}
for orig in splitoriginals:
basedir, basefolder = s7.getbasedir(orig)
filesx[orig] = {}
filesx[orig][fld.BASEPATH] = fld.SAMEFOLDER + basedir + basefolder[1:]
filesx[orig][fld.FILES] = (fld.SPLITFILE.format(filesx[orig][fld.BASEPATH], filex[1])
for filex in splitfiles if filex[0] == orig)
for orig in filesx:
with open(filesx[orig][fld.BASEPATH], fld.WB) as mainfile:
for filex in filesx[orig][fld.FILES]:
with open(filex, fld.RB) as splitfile:
mainfile.write(splitfile.read())
def restore(dirx):
"""
Restores MineSight project directory
inside program path.
dirx is a string for the directory
to be restored (./Fwaulu, for example).
Side effect function.
"""
extractfiles(dirx)
gluetogethersplitfiles(dirx)
print('Done')
restore is the main function for the module (uncompression).
Notes:
1) I don't have admin rights at work and did not have javac (the compiler for java) available. You can download an SDK or SRE java package from Oracle that has it. Without admin rights, you can't install it normally. Still you can use it. My compilation went something like this:
<path to downloaded JDK>/bin/javac -cp <path to downloaded 7-ZipJBinding>/lib/* <myclassname>.java
2) I've left all the split up files and 7z archives in the folder where I decompress my files and recombine the split files. This takes up a lot of space depending on what you're working with. If space is at a premium, you probably want to write jython code to move or delete the archives after uncompressing them.
3) The most time consuming part of runtime is the compression, uncompression, and splitting and recombining of split files. Porting some of this to java (instead of jython) might speed things up. I code faster and generally better in jython. Also, my objective was control, not speed. YMMV (your mileage may vary) with this approach. There are far better general purpose ones.
Thanks for stopping by.
Subscribe to:
Posts (Atom)