Monday, April 18, 2016

7-Zip-JBinding API with jython on Windows

I have a set of multi-GB Windows folders that I need to archive in 7-zip format each month.  I'd prefer not to use the mouse to compress the folders "manually."  Also, I didn't want to use the command line with the subprocess module like I have with some other programs.  Ideally, I wanted to control 7zip programmatically.  The 7-Zip-JBinding libraries offered a means to do this from jython.

7-Zip-JBinding is written using java Interfaces that are structured pretty specifically.  I did not venture too far away from the examples given in the 7-Zip-JBinding documentation.  I smithed two modules for my own purposes, compressing and uncompressing, and present them (java code) below.  The decompression one has a separate method for retrieving paths of the compressed files.  This is not efficient, but for what I need to do, and for the limitations of the library and the approach, it works out for the best.

import java.io.IOException;
import java.io.RandomAccessFile;

import net.sf.sevenzipjbinding.IOutCreateArchive7z;
import net.sf.sevenzipjbinding.IOutCreateCallback;
import net.sf.sevenzipjbinding.IOutItem7z;
import net.sf.sevenzipjbinding.ISequentialInStream;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.OutItemFactory;
import net.sf.sevenzipjbinding.impl.RandomAccessFileOutStream;
import net.sf.sevenzipjbinding.util.ByteArrayStream;


/* Off StackOverflow - works for getting
 * file content/bytes from path */
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.Path;


public class SevenZipThing {

    private static final String RETCHAR = "\n";
    private static final String INTFMT = "%,d";
    private static final String BYTESTOCOMPRESS = " bytes total to compress\n";
    private static final String ERROCCURS = "Error occurs: ";
    private static final String COMPRESSFILE = "\nCompressing file ";
    private static final String RW = "rw";
    private static final int LVL = 5;
    private static final String SEVZERR = "7z-Error occurs:";
    private static final String ERRCLOSING = "Error closing archive: ";
    private static final String ERRCLOSINGFLE = "Error closing file: ";
    private static final String SUCCESS = "\nCompression operation succeeded\n";


    private String filename;
    /* String[] array conversion from jython list
     * implicit and poses no problems (JKD7) */
    private String[] pathsx;


    public SevenZipThing(String filename, String[] pathsx) {
        this.filename = filename;
        this.pathsx = pathsx;
    }


    /**
     * The callback provides information about archive items.
     */
    /** 


   * I copied this straight from the sevenZipJBinding's author's
     * code - but I haven't put much in to deal with messaging
     * or error handling
   * */
    private final class MyCreateCallback
            implements IOutCreateCallback<IOutItem7z> {


        public void setOperationResult(boolean operationResultOk)
                throws SevenZipException {
            // Track each operation result here
        }


        public void setTotal(long total) throws SevenZipException {
            // Track operation progress here
    
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                     BYTESTOCOMPRESS);
        }


        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }


        public IOutItem7z getItemInformation(int index,
                OutItemFactory<IOutItem7z> outItemFactory) {

            IOutItem7z item = outItemFactory.createOutItem();
            Path path = Paths.get(pathsx[index]);
            item.setPropertyPath(pathsx[index]);
            try {
                // Java arrays are limited to 2 ** 31 items - small.
                byte[] data = Files.readAllBytes(path);
                item.setDataSize((long) data.length);
                return item;
            // XXX - I could do a lot better than this (error handling).
            } catch (Exception e) {
                System.err.println(ERROCCURS + e);
            }
            return null;
        }


        public ISequentialInStream getStream(int i)
            throws SevenZipException {

            Path path = Paths.get(pathsx[i]);
            try {
                byte[] data = Files.readAllBytes(path);
                System.out.println(COMPRESSFILE + path);
                return new ByteArrayStream(data, true);
            } catch (Exception e) {
                System.err.println(ERROCCURS + e);
            }
            return null;
        }
    }


    public void compress() {
       
        /* Mostly copied from sevenZipJBinding's author's code -
         * I made the compress method public to work from jython.
         * Also, I deal with all of the file listing in jython
         * and just pass a list to this class. */

        boolean success = false;
        RandomAccessFile raf = null;
        IOutCreateArchive7z outArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);

            // Open out-archive object
            outArchive = SevenZip.openOutArchive7z();

            // Configure archive
            outArchive.setLevel(LVL);
            outArchive.setSolid(true);

      // All available processors.
      outArchive.setThreadCount(0);

            // Create archive
            outArchive.createArchive(new RandomAccessFileOutStream(raf),
                    pathsx.length, new MyCreateCallback());
            success = true;
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (outArchive != null) {
                try {
                    outArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                    success = false;
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                    success = false;
                }
            }
        }
        if (success) {
            System.out.println(SUCCESS);
        }
    }
}


import java.io.IOException;
import java.io.RandomAccessFile;
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;

import java.util.Arrays;
import java.util.ArrayList;

import net.sf.sevenzipjbinding.IInArchive;
import net.sf.sevenzipjbinding.PropID;
import net.sf.sevenzipjbinding.SevenZip;
import net.sf.sevenzipjbinding.SevenZipException;
import net.sf.sevenzipjbinding.impl.RandomAccessFileInStream;
import net.sf.sevenzipjbinding.IArchiveExtractCallback;
import net.sf.sevenzipjbinding.ExtractOperationResult;
import net.sf.sevenzipjbinding.ExtractAskMode;
import net.sf.sevenzipjbinding.ISequentialOutStream;

/* 7z archive format */
/* SEVEN_ZIP is the one I want */
import net.sf.sevenzipjbinding.ArchiveFormat;


public class SevenZipThingExtract {
    private String filename;
    private String extractdirectory;
    private ArrayList<String> foldersx = null;
  private boolean subdirectory = false;

    private static final String ERROPENINGFLE = "Error opening file: ";
    private static final String ERRWRITINGFLE = "Error writing to file: ";
    private static final String EXTERR = "Extraction error";
    private static final String INFOFMT = "%9X | %10s | %s";
    private static final String RETCHAR = "\n";
    private static final String INTFMT = "%,d";
    private static final String BYTESTOEXTRACT = " bytes total to extract\n";
    private static final String RW = "rw";
    private static final String BACKSLASH = "\\";
    private static final String SEVZERR = "7z-Error occurs:";
    private static final String ERROCCURS = "Error occurs: ";
    private static final String ERRCLOSING = "Error closing archive: ";
    private static final String ERRCLOSINGFLE = "Error closing file: ";


    public SevenZipThingExtract(String filename, String extractdirectory,
                                boolean subdirectory) {
        this.filename = filename;
        foldersx = new ArrayList<String>();
        this.foldersx = foldersx;
        this.extractdirectory = extractdirectory;
        this.subdirectory = subdirectory;
    }


    private final class MyExtractCallback
            implements IArchiveExtractCallback {

        // Copied mostly from example.
        private int hash = 0;
        private int size = 0;
        private int index;
        private boolean skipExtraction;
        private IInArchive inArchive;

        private OutputStream outputStream;
        private File file;


        public MyExtractCallback(IInArchive inArchive) {
            this.inArchive = inArchive;
        }


        @Override
        public ISequentialOutStream getStream(int index,
                          ExtractAskMode extractAskMode)
                throws SevenZipException {


             this.index = index;
             // I'm not skipping anything.
             skipExtraction = (Boolean) false;

             String path = (String) inArchive.getProperty(index, PropID.PATH);
             // Try preprending extractdirectory.
             if (subdirectory) {
                 path = extractdirectory + BACKSLASH + path.substring(2);
             } else {
                 path = extractdirectory + BACKSLASH + path;
             }
             file = new File(path);

            try {
                outputStream = new FileOutputStream(file);
            } catch (FileNotFoundException e) {
                throw new SevenZipException(ERROPENINGFLE
                        + file.getAbsolutePath(), e);
            }
            return new ISequentialOutStream() {
                public int write(byte[] data) throws SevenZipException {
                   try {
                       outputStream.write(data);
                   } catch (IOException e) {
                       throw new SevenZipException(ERRWRITINGFLE
                               + file.getAbsolutePath());
                   }
                   return data.length; // Return amount of consumed data
                }
            };
       }


        public void prepareOperation(ExtractAskMode extractAskMode)
                throws SevenZipException {
        }

        public void setOperationResult(ExtractOperationResult extractOperationResult)
                throws SevenZipException {
            // Track each operation result here
            if (extractOperationResult != ExtractOperationResult.OK) {
                System.err.println(EXTERR);
            } else {
                System.out.println(String.format(INFOFMT, hash, size,//
                        inArchive.getProperty(index, PropID.PATH)));
                hash = 0;
                size = 0;
            }
        }


        public void setTotal(long total) throws SevenZipException {
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                             BYTESTOEXTRACT);
        }


        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }
    }


    private final class MyGetPathsCallback
            implements IArchiveExtractCallback {

        // Copied mostly from example.
        private int hash = 0;
        private int size = 0;
        private int index;
        private boolean skipExtraction;
        private IInArchive inArchive;

        public MyGetPathsCallback(IInArchive inArchive) {
            this.inArchive = inArchive;
        }

        public ISequentialOutStream getStream(int index,
            ExtractAskMode extractAskMode)
                throws SevenZipException {
             this.index = index;
             // I'm not skipping anything.
             skipExtraction = (Boolean) false;

             String path = (String) inArchive.getProperty(index,
                 PropID.PATH);
             foldersx.add(path);

             return new ISequentialOutStream() {
                public int write(byte[] data) throws SevenZipException {
                    hash ^= Arrays.hashCode(data);
                    size += data.length;
                    // Return amount of processed data
                    return data.length;
                }
            };
        }


        public void prepareOperation(ExtractAskMode extractAskMode)
                throws SevenZipException {
        }


        public void setOperationResult(ExtractOperationResult extractOperationResult)
                throws SevenZipException {
            // Track each operation result here
            if (extractOperationResult != ExtractOperationResult.OK) {
                System.err.println(EXTERR);
            } else {
                System.out.println(String.format(INFOFMT, hash, size,
                        inArchive.getProperty(index, PropID.PATH)));
                hash = 0;
                size = 0;
            }
        }


        public void setTotal(long total) throws SevenZipException {
            System.out.print(RETCHAR + String.format(INTFMT, total) +
                BYTESTOEXTRACT);
        }


        public void setCompleted(long complete) throws SevenZipException {
            // Track operation progress here
        }
    }


    public void extractfiles() {
       
        boolean success = false;
        RandomAccessFile raf = null;
        IInArchive inArchive = null;
        try {
            raf = new RandomAccessFile(filename, RW);

            inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
                    new RandomAccessFileInStream(raf));

            int itemCount = inArchive.getNumberOfItems();
           
            // From StackOverflow - could use IntStream,
            // but that's Java 1.8 (using 1.7).
            int[] fileindices = new int[itemCount];
            for(int k = 0; k < fileindices.length; k++)
                fileindices[k] = k;
            inArchive.extract(fileindices, false,
                new MyExtractCallback(inArchive));
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (inArchive != null) {
                try {
                    inArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                }
            }
        }
    }


    public ArrayList<String> getfolders() {
       
        boolean success = false;
        RandomAccessFile raf = null;
        IInArchive inArchive = null;

        try {
            raf = new RandomAccessFile(filename, RW);

            inArchive = SevenZip.openInArchive(ArchiveFormat.SEVEN_ZIP,
                    new RandomAccessFileInStream(raf));

            int itemCount = inArchive.getNumberOfItems();
           
            // From StackOverflow - could use IntStream,
            // but that's Java 1.8 (using 1.7).
            int[] fileindices = new int[itemCount];
            for(int k = 0; k < fileindices.length; k++)
                fileindices[k] = k;
            inArchive.extract(fileindices, false,
                new MyGetPathsCallback(inArchive));
        } catch (SevenZipException e) {
            System.err.println(SEVZERR);
            // Get more information using extended method
            e.printStackTraceExtended();
        } catch (Exception e) {
            System.err.println(ERROCCURS + e);
        } finally {
            if (inArchive != null) {
                try {
                    inArchive.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSING + e);
                }
            }
            if (raf != null) {
                try {
                    raf.close();
                } catch (IOException e) {
                    System.err.println(ERRCLOSINGFLE + e);
                }
            }
        }
        return foldersx;
    }
}


The method getfolders in the SevenZipThingExtract class is the extra method to get the list of folders.  As noted in the jython code below, the limitations on the number of bytes and files to be compressed necessitates splitting larger files into chunks.  Also, for my specific use case, I need to extract files to a specific folder and set of subfolders.  My methodology is outlined in the comments in the jython code.  The good news:  if I get run over by a bus and the uncompression part of the program gets lost, people will be able to get the files back with some effort.  The bad news:  they will be cursing my headstone.  You do the best you can.

The three jython modules - the first one, folderstozip.py is just constants:


#!java -jar C:\jython-2.7.0\jython.jar

# folderstozip.py

"""
Constants used in compression and
decompression.
"""


FRONTSLASH = '/'
BACKSLASH = '\\'
EMPTY = ''
SAMEFOLDER = './'
SAMEFOLDERWIN = u'.\\'

SPLITFILETRACKER = 'SPLITFILETRACKER.csv'
SPLITFILE = '{0:s}.{1:s}'
UCOMMA = u','

# 3rd party sevenZipJBindings library.
PATH7ZJB = 'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJB += '/Backup/sevenzipjbinding/lib/sevenzipjbinding.jar'


# OS specific 3rd party sevenZipJBindings library.
PATH7ZJBOSSPEC = r'C:/MSPROJECTS/EOMReconciliation/2016/03March'
PATH7ZJBOSSPEC += '/Backup/sevenzipjbinding/lib/sevenzipjbinding-Windows-amd64.jar'


PROGFOLDER = 'C:/MSPROJECTS/EOMReconciliation/2016/03March/Backup'
PROGFOLDER += FRONTSLASH


# Informational messages.
WROTEFILE = 'Wrote file {:s}\n'
SPLITFILEMSG = 'Have now split {0:,d} bytes of file {1:s} into {2:d} {3:,d} chunks.\n'
DONESPLITTING = '\nDone splitting file'
FILESAFTERSPLIT = '\n{:d} files after split'

COMPRESSING = '\nCompressing file {:s} . . .\n'
DELETING = '\nDeleting file {:s} . . .\n'
DELETINGDIR = '\nNow deleting {:s} . . .\n'


# Room for 9999 file names.
UNIQUEX = '{0:05d}'


# XXX - multiple file archives limited to
#       10KB - reason unknown - crashes jvm
#       with IInStream interface class not
#       found.
# XXX - choked on 8700 bytes - try dropping
#       this from 9500 to 8500.
MULTFILELIMIT = 8500
HALFLIMIT = MULTFILELIMIT/2

# About 50 splits for a 3GB file.
CHUNK = 2 ** 26


# Path plus split number.
FILEN = r'{0:s}.{1:03d}'

# Path plus basefilename.
FILEB = r'{0:s}{1:s}'


# Read/Write constants.
RB = 'rb'
WB = 'wb'
W = 'w'


# Filename plus split number.
ARCHIVEX = '{0:s}/{1:s}.7z'


# multifile archive
MULTARCHIVEX = '{0:s}/archive{1:03d}.7z'
MULTFILES = '. . . multiple files'


# File categories.
# Size less than HALFLIMIT.
SMALL = 'small'
# Size greater than or equal to HALFLIMIT but
# less than or equal to CHUNK.
MEDIUM = 'medium'
# Larger than CHUNK.
LARGE = 'large'


BASEPATH = 'basepath'

FILES = 'files'


# XXX - this folder has recognizable
#       folder names within your domain
#       space - mine are open pit mining
#       area names.
BASEDIRS = ['Pit-1', 'Pit-2', 'Pit-3']


#!java -jar C:/jython-2.7.0/jython.jar

# sevenzipper.py

"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to 7zip up MineSight project
files.
"""


import folderstozip as fld
# Need to adjust path to get necessary jar imports.
import sys

# Need for os.path
import os


# Original path of file plus split number.
SPLITFILERECORD = '{0:s},{1:03d}'


sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)


# java 7zip library
import SevenZipThing as z7thing


# For copying files to program
# directory and deleting the old
# ones where necessary.
import shutil

# For unique archive names.
import itertools


COUNTERX = itertools.count(0, 1)


def splitfile(originalfilepath, splitfilestrackerfile):
    """
    Split file at (string) originalfilepath
    into fld.CHUNK sized chunks and indicate
    sequence by number in new split file
    name.

    Return generator of relative file paths
    inside project folder.

    originalfilepath is the path of the
    file that needs to be split into parts.

    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.
    """
    sizeoffile = os.path.getsize(originalfilepath)
    chunks = sizeoffile/fld.CHUNK + 1
    # Counter.
    i = 1
    with open(originalfilepath, fld.RB) as f:
        while i < chunks + 1:
            with open(fld.FILEN.format(originalfilepath, i), fld.WB) as f2:
                f2.write(f.read(fld.CHUNK))
                print(fld.WROTEFILE.format(fld.FILEN.format(originalfilepath, i)))
                print(fld.SPLITFILEMSG.format(f.tell(), originalfilepath, i, fld.CHUNK))
                print >> splitfilestrackerfile, (SPLITFILERECORD.format(originalfilepath, i))
                i += 1
    print(fld.DONESPLITTING)
    print(fld.FILESAFTERSPLIT.format(i - 1))
    return (fld.FILEN.format(originalfilepath, x) for x in xrange(1, i))


def movefiles(movefilesx, intermediatepath):
    """
    Move files from MineSight project directory
    to program directory.

    Return a list of base file names for the
    moved files.

    movefilesx is a generator of file paths.
    intermediatepath is a string relative path
    between the program folder and the sub-folder
    of the MineSight directory (_msresources/06SOLIDS,
    for example).
    """
    # Move files to that folder.
    movedfiles = []
    for pathx in movefilesx:
        shutil.move(pathx, fld.PROGFOLDER + intermediatepath +
                    os.path.basename(pathx))
        movedfiles.append(intermediatepath + os.path.basename(pathx))
    return movedfiles


def copyfiles(copyfilesx, intermediatepath):
    """
    Copy files from MineSight project directory
    to program directory.

    Return a list of base file names for the
    copied files.

    copyfilesx is a generator of file paths.
    intermediatepath is a string relative path
    between the program folder and the sub-folder
    of the MineSight directory (_msresources/06SOLIDS,
    for example).
    """
    # Copy files to that folder.
    copiedfiles = []
    for pathx in copyfilesx:
        shutil.copyfile(pathx, fld.PROGFOLDER + intermediatepath +
                        os.path.basename(pathx))
        copiedfiles.append(intermediatepath + os.path.basename(pathx))
    return copiedfiles


def compressfilessingle(filestocompress, prefix, basedir):
    """
    Compresses files into an archive.

    This is for larger files that take up
    an entire archive (7z file).

    filestocompress is a list of paths of
    files to be compressed.  These files
    reside inside the program directory.

    prefix is a string path addition, usually
    './' that allows the function to deal
    with relative paths for files that reside
    in subfolders.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Side effect function.
    """
    for pathx in filestocompress:
        basename = os.path.split(pathx)[1]
        # Need unique name for subfolder files with same names.
        uniqueid = fld.UNIQUEX.format(COUNTERX.next())
        uniquename = uniqueid + basename
        print(fld.COMPRESSING.format(prefix + basename))
        archx = z7thing(fld.ARCHIVEX.format(basedir, uniquename),
                        [prefix + basename])
        archx.compress()


def compressfilesmultiple(filestocompress, indexx, basedir):
    """
    Compresses files into an archive.

    filestocompress is a list of paths of
    files to be compressed.  These files
    reside inside the program directory.

    indexx is an integer that gives the
    archive a unique name.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Side effect function.
    """
    print(fld.COMPRESSING.format(fld.MULTFILES))
    archx = z7thing(fld.MULTARCHIVEX.format(basedir, indexx),
                                            filestocompress)
    archx.compress()


def segregatefiles(directoryx, basefiles):
    """
    From a string directory path directoryx
    and a list of base file names, returns
    a dictionary of lists of files and their
    sizes sorted on size and keyed on file
    category.
    """
    retval = {}
    # Add separator to end of directory path.
    directoryx += fld.FRONTSLASH
    # Get all files in folder and their sizes.
    allfiles = [(os.path.getsize(fld.FILEB.format(directoryx, filex)), filex)
                 for filex in basefiles]
    retval[fld.SMALL] = [x for x in allfiles if x[0] < fld.HALFLIMIT]
    retval[fld.SMALL].sort()
    retval[fld.MEDIUM] = [x for x in allfiles if x[0] >= fld.HALFLIMIT and
                          x[0] <= fld.CHUNK]
    retval[fld.MEDIUM].sort()
    retval[fld.LARGE] = [x for x in allfiles if x[0] > fld.CHUNK]
    retval[fld.LARGE].sort()
    return retval


def deletefiles(movedfiles):
    """
    Delete files that have been compressed.

    movedfiles is a list of paths of
    files that have been moved or copied to
    the program directory for compression.

    Side effect function.
    """
    for pathx in movedfiles:
        print(fld.DELETING.format(pathx))
        os.remove(pathx)


def getsmallfilegroupings(smallfiles):
    """
    Generator function that yields
    a list of files whose sum is
    less than the program's limit
    for bytes to be archived in a
    multiple file archive.

    smallfiles is a list of two tuples
    of (filesize in bytes, file path).
    """
    lenx = len(smallfiles)
    insidecounter1 = 0
    insidecounter2 = 1
    sumx = 0
    while (insidecounter2 < (lenx + 1)):
        sumx = sum(x[0] for x in smallfiles[insidecounter1:insidecounter2])
        if sumx > fld.MULTFILELIMIT:
            # Back up one.
            insidecounter2 -= 1
            yield (x[1] for x in smallfiles[insidecounter1:insidecounter2])
            # Reset and advance counters.
            sumx = 0
            insidecounter1 = insidecounter2 + 1
            insidecounter2 = insidecounter1 + 1
        else:
            insidecounter2 += 1


def compresslargefiles(largefiles, dirx, prefix, basedir, splitfilestrackerfile):
    """
    Deal with compression of files that need to
    be split prior to compression.

    largefiles is a list of two tuples of file
    sizes and names.

    dirx is the directory (str) in which the files
    are located.

    prefix is a string prefix to augment path
    identification for compression.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.

    Side effect function.
    """
    for filex in largefiles:
        # Get generator of paths of splits.
        splitfiles = splitfile(fld.FILEB.format(dirx, filex[1]),
                               splitfilestrackerfile)
        movedfiles = movefiles(splitfiles, prefix)
        compressfilessingle(movedfiles, prefix, basedir)
        deletefiles(movedfiles)

def compressmediumfiles(mediumfiles, dirx, prefix, basedir):
    """
    Deal with compression of files that need to
    be compressed each to its own archive.

    mediumfiles is a list of two tuples of file
    sizes and paths.

    dirx is the directory (str) in which the files
    are located.

    prefix is a string prefix to augment path
    identification for compression.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Side effect function.
    """
    filestocopy = (dirx + x[1] for x in mediumfiles)
    copiedfiles = copyfiles(filestocopy, prefix)
    compressfilessingle(copiedfiles, prefix, basedir)
    deletefiles(copiedfiles)

def compresssmallfiles(smallfiles, dirx, prefix, indexx, basedir):
    """
    Deal with compression of files that can be
    compressed in groups.

    mediumfiles is a list of two tuples of file
    sizes and paths.

    dirx is the directory (str) in which the files
    are located.

    prefix is a string prefix to augment path
    identification for compression.

    indexx is the current index that the 7zip
    file counter (ensures unique archive name)
    is on.

    basedir is the name of the main MineSight
    project directory (Fwaulu, for example).

    Returns integer for current archive counter
    index.
    """
    smallgroupings = getsmallfilegroupings(smallfiles)
    while True:
        try:
            grouplittlefiles = smallgroupings.next()
            littlefiles = (dirx + x for x in grouplittlefiles)
            copiedfiles = copyfiles(littlefiles, prefix)
            compressfilesmultiple(copiedfiles, indexx, basedir)
            indexx += 1
            deletefiles(copiedfiles)
        except StopIteration:
            break
    return index


# XXX - hack
def matchbasedir(folderlist):
    """
    Get MineSight project folder name
    that matches a folder in the path
    in question.

    folderlist is a list (in order)
    of directories in a path.

    Returns string.
    """
    for folderx in folderlist:
        for projx in fld.BASEDIRS:
            if projx == folderx:
                return folderx
    return None


def getbasedir(pathx):
    """
    Returns two tuple of strings for
    basedir and basefolder (project
    directory name and base path under
    project directory copied to program
    directory).

    pathx is the directory path being
    processed (str).
    """
    # basedir is project name (Fwaulu, for example).
    foldernames = pathx.split(fld.FRONTSLASH)
    basedir = matchbasedir(foldernames)
    # Get directory under project directory.
    # _msresources, for example.
    idx = foldernames.index(basedir)
    # Directory under program directory ./ for MineSight files.
    basefolder = fld.SAMEFOLDER + fld.FRONTSLASH.join(foldernames[idx + 1:])
    return basedir, basefolder


def dealwithtoplevel(firstdir):
    """
    Compress top level files in the
    MineSight project directory.
   
    firstdir is the three tuple returned
    from the os.walk() generator function.

    Returns two tuple of integer smallfile
    multifilecounter used for naming
    multiple file archives and splitfilestrackerfile,
    an open file object for tracking split
    files for later reconstruction.
    """
    # Top level files.
    dirx = firstdir[0] + fld.FRONTSLASH
    basedir, basefolder = getbasedir(dirx)
    # File to track split files for later glueing back together.
    splitfilestrackerfile = open(fld.SAMEFOLDER + basedir + fld.FRONTSLASH +
                                 fld.SPLITFILETRACKER, fld.W)
    firstdirfiles = segregatefiles(firstdir[0], firstdir[2])
    compresslargefiles(firstdirfiles[fld.LARGE], dirx, fld.EMPTY, basedir,
                       splitfilestrackerfile)
    compressmediumfiles(firstdirfiles[fld.MEDIUM], dirx, fld.EMPTY, basedir)
    # This is for keeping track of
    # archives with more than one file.
    multifilecounter = 1
    mulitfilecounter = compresssmallfiles(firstdirfiles[fld.SMALL], dirx,
                                          fld.EMPTY, multifilecounter, basedir)
    return multifilecounter, splitfilestrackerfile


def dealwithlowerleveldirectories(dirs, multifilecounter, splitfilestrackerfile):
    """
    Finishes out compression of lower level
    folders under top level MineSight project
    directory.

    dirs is a partially exhausted (one iteration)
    os.walk() generator.

    multifilecounter is an integer used for
    naming multiple file archives.

    splitfilestrackerfile is an open file
    object used for tracking file splits
    for later retrieval.

    Returns orphanedfolders, a list of lower level
    folders to be deleted at the end of the program
    run.
    """
    orphanedfolders = []
    for dirx in dirs:
        # XXX - hack - I hate dealing with Windows paths.
        dirn = dirx[0].replace(fld.BACKSLASH, fld.FRONTSLASH)
        diry = dirn + fld.FRONTSLASH
        basedir, basefolder = getbasedir(diry)
        # Create directory in program path.
        fauxdir = fld.PROGFOLDER[:-1] + basefolder[1:-1]
        os.mkdir(fauxdir)
        orphanedfolders.append(fauxdir)
        # Skip anything that doesn't have files.
        if not dirx[2]:
            continue
        # Easiest way to do this might be
        # to track directories and sort
        # files according to size, then
        # filter them accordingly.
        dirfiles = segregatefiles(dirx[0], dirx[2])
        compresslargefiles(dirfiles[fld.LARGE], diry, basefolder,
                           basedir, splitfilestrackerfile)
        compressmediumfiles(dirfiles[fld.MEDIUM], diry, basefolder, basedir)
        multifilecounter = compresssmallfiles(dirfiles[fld.SMALL], diry, basefolder,
                                              multifilecounter, basedir)
    splitfilestrackerfile.close()
    return orphanedfolders


def walkdir(dirx):
    """
    Traverse MineSight project directory,
    7zipping everything along the way.

    dirx is a string for the directory
    to traverse.

    Side effect function.
    """
    dirs = os.walk(dirx)
    # OK - os.walk returns generator that
    #      yields a tuple in the format
    #          (str path,
    #           [list of paths for directories under path],
    #           [list of filenames under path])

    # Top level (Fwaulu, for instance).
    # These files will not have a path
    # prefix of any sort in their respective
    # archives.
    firstdir = dirs.next()
    multifilecounter, splitfilestrackerfile = dealwithtoplevel(firstdir)
    # All other files and folders.
    orphanedfolders = dealwithlowerleveldirectories(dirs, multifilecounter,
                                                    splitfilestrackerfile)
    # Delete lower level folders first - this is necessary.
    orphanedfolders.reverse()
    for orphanx in orphanedfolders:
        print(fld.DELETINGDIR.format(orphanx))
        os.rmdir(orphanx)


def cyclefolders(folderx):
    """
    Wrapper function for compression
    of folder folderx (string).

    Side effect function.
    """
    # 1) Set up empty project directory (ex: Fwaulu)
    #    in program directory.
    # 2) For first set of files, use no prefix for
    #    7zip archive storage (filename only).
    # 3) Check for size of file.
    # 4) If file is bigger than fld.CHUNK, split.
    # 5) If file is smaller than fld.CHUNK, but bigger than
    #    MULTFILELIMIT, compress to one archive.
    # 6) If file is smaller than fld.CHUNK, and smaller than
    #    MULTFILELIMIT, check subsequent files to determine
    #    files to include in archive. Keep track of file
    #    index that puts number of bytes over limit.
    # 7) Compress multiple files to one archive - index
    #    archive to ensure unique name.
    # 8) For all following sets of files, same process,
    #    but must prefix paths with SAMEFOLDER and any
    #    additional folder names.
    foldertracker = []
    # Make directory folder in program directory
    # to hold 7zip files.
    zipfolder = getbasedir(folderx)[0]
    os.mkdir(zipfolder)
    foldertracker.append(zipfolder)
    walkdir(folderx)
    print('\nDone')


cyclefolders is the overarching wrapper function for the module (compression operation).

#!java -jar C:\jython2.7.0\jython.jar

# unsevenzipper.py

"""
Use java 3rd party 7-zip compression
library (sevenZipJBindings) from
jython to un-7zip archives.
"""


# Need to adjust path to get necessary jar imports.
# XXX - it might be cleaner to chain imports by using
#       the sevenzipper (s7 alias) below to reference
#       double imported modules.  For development and
#       convenience I reimported everything as though
#       sevenzipper.py and unsevenzipper.py were separate
#       operations.
import sys
import folderstozip as fld
sys.path.append(fld.PATH7ZJB)
sys.path.append(fld.PATH7ZJBOSSPEC)

import os
import sevenzipper as s7
import SevenZipThingExtract

def subdirectoryornot(pathx):
    """
    Boolean function that returns
    True if string pathx is a
    subdirectory of the MineSight
    project folder and False if
    the files belong directly to
    the MineSight project folder.
    """
    pathx = pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH)
    pathlist = pathx.split(fld.BACKSLASH)
    if len(pathlist) > 1:
        return True
    return False


def getdirectories(dirx):
    """
    Get list of lists of directories
    in path under project folder
    from 7zip archives in project
    folder for archives.

    Returns two tuple of list and
    dictionary indicating which
    7z files are same directory
    archives and which are archived
    subdirectory files.

    dirx is a string for the file
    path of the directory to
    be walked (./Fwaulu for example).
    """
    dirs = os.walk(dirx)
    # One level, no subfolders.
    files = dirs.next()[2]
    # Get directories first.
    rawpaths = []
    subdirornot = {}
    for filex in files:
        # Skip uncompressed split file tracker.
        if filex == fld.SPLITFILETRACKER:
            continue
        # I don't know if it's a subdirectory or not, so I'll go with False.
        s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex, dirx, False)
        folders = list(s7tx.getfolders())
        rawpaths.extend(folders)
        # All the paths in folders have the same prefix -
        # just do one.
        subdirornot[filex] = subdirectoryornot(folders[0])
    # Get just directories
    justdirectories = [pathx.replace(fld.SAMEFOLDERWIN, fld.BACKSLASH).split(fld.BACKSLASH)[1:-1]
                       for pathx in rawpaths if pathx.split(fld.BACKSLASH)[1:-1]]
    justdirectories = set([tuple(x) for x in justdirectories])
    justdirectories = list(justdirectories)
    justdirectories.sort()
    return justdirectories, subdirornot


def makedirectories(dirn):
    """
    Create directory paths within archive
    project folder to accept uncompressed
    files.

    Returns subdirornot dictionary.
    dirn is a string for the file
    path of the directory to
    be walked (./Fwaulu for example).
    """
    justdirectories, subdirornot = getdirectories(dirn)
    maxdepth = max(len(dirx) for dirx in justdirectories)
    for x in xrange(0, maxdepth):
        justdirectoriesii = set([tuple(dirx[0:x + 1]) for dirx in justdirectories
                                 if len(dirx) >= x + 1])
        for diry in justdirectoriesii:
            dirw = dirn + fld.FRONTSLASH + fld.FRONTSLASH.join(diry)
            os.mkdir(dirw)
    return subdirornot

def extractfiles(dirx):
    """
    Extract files from 7z files
    in project archive folder.

    Side effect function.
    dirx is a string for the file
    path of the directory to
    be walked.
    """
    subdirornot = makedirectories(dirx)
    dirs = os.walk(dirx)
    # One level, no subfolders.
    files = dirs.next()[2]
    for filex in files:
        # Skip uncompressed split file tracker.
        if filex == fld.SPLITFILETRACKER:
            continue
        s7tx = SevenZipThingExtract(dirx + fld.FRONTSLASH + filex,
                                    dirx, subdirornot[filex])
        s7tx.extractfiles()


def gluetogethersplitfiles(dirx):
    """
    Make split up files whole.

    Side effect function.
    dirx is the folder in which the split
    files reside.
    """
    # Glue together big files.
    # Do this in a very controlling,
    # structured way:
    # 1) Read the split file tracker csv file.
    # 2) Determine the number and names and paths
    #    of files to be reconstructed and the
    #    number of parts in each.
    # 3) Check that everything is there for
    #    each file to be reconstructed.
    # 4) Get the new relative path.
    # 5) Glue back together programmatically.
    splitfiles = []
    # fld.SPLITFILETRACKER is structured as original path
    # of file split, number of file split.
    with open(fld.SAMEFOLDERWIN + dirx +
              fld.FRONTSLASH + fld.SPLITFILETRACKER, 'r') as f:
        for linex in f:
            strippedline = [x.strip() for x in linex.split(fld.UCOMMA)]
            splitfiles.append(tuple(strippedline))
    orignames = [x[0] for x in splitfiles]
    splitoriginals = set(orignames)
    # Make dictionary that is easy to cycle through.
    filesx = {}
    for orig in splitoriginals:
        basedir, basefolder = s7.getbasedir(orig)
        filesx[orig] = {}
        filesx[orig][fld.BASEPATH] = fld.SAMEFOLDER + basedir + basefolder[1:]
        filesx[orig][fld.FILES] = (fld.SPLITFILE.format(filesx[orig][fld.BASEPATH], filex[1])
                for filex in splitfiles if filex[0] == orig)
    for orig in filesx:
        with open(filesx[orig][fld.BASEPATH], fld.WB) as mainfile:
            for filex in filesx[orig][fld.FILES]:
                with open(filex, fld.RB) as splitfile:
                    mainfile.write(splitfile.read())


def restore(dirx):
    """
    Restores MineSight project directory
    inside program path.

    dirx is a string for the directory
    to be restored (./Fwaulu, for example).

    Side effect function.
    """
    extractfiles(dirx)
    gluetogethersplitfiles(dirx)
    print('Done')


restore is the main function for the module (uncompression).

Notes:

1) I don't have admin rights at work and did not have javac (the compiler for java) available.  You can download an SDK or SRE java package from Oracle that has it.  Without admin rights, you can't install it normally.  Still you can use it.  My compilation went something like this:

<path to downloaded JDK>/bin/javac -cp <path to downloaded 7-ZipJBinding>/lib/* <myclassname>.java

2) I've left all the split up files and 7z archives in the folder where I decompress my files and recombine the split files.  This takes up a lot of space depending on what you're working with.  If space is at a premium, you probably want to write jython code to move or delete the archives after uncompressing them.

3) The most time consuming part of runtime is the compression, uncompression, and splitting and recombining of split files.  Porting some of this to java (instead of jython) might speed things up.  I code faster and generally better in jython.  Also, my objective was control, not speed.  YMMV (your mileage may vary) with this approach.  There are far better general purpose ones.

Thanks for stopping by.

1 comment:

  1. Python is very cute code. I know, that virtual data room free are also made with help of this, because it helps to provide high level of security!

    ReplyDelete