pyright: Downloading a Bunch of MP3's off the Internet (Foreign Language Tapes)

A mining bud Jen wrote a blog post lamenting the difficulty of learning a foreign language as an adult in a far off land. This inspired me to clean up my "download the Foreign Service Institute" French "tapes" (mp3's, actually) script I wrote for myself and publish it.

I'm not very astute on web programming. This script came out of necessity. There may be other, more efficient ways to do this. If you have a slow connection a piecemeal approach will probably be required. It took about 20 minutes to get all these files over a decent Verizon MIFI unit connection (I, unfortunately, don't have speed metrics available).

Notes about the downloaded product: the US State Department's language tapes and lessons were mostly written and produced 30 to 50 years ago. It's not Rosetta Stone, but I have found them to have value when it comes to practicing pronunciation, including cadence and rhythm of the foreign language - things you just can't get from printed or displayed text.

My late wife gifted me some Spanish tapes prior to the internet age that helped me out. I am by no means fluent in Spanish, but I can say Hacemos lo que podemos hasta que nos boten (this may not be entirely grammatically correct) to the Spanish speaking mining engineers and get a laugh.

The original names of the mp3's are unnecessarily long and have the appearance of having been created by the Department of Redundancy Department. It's a government thing, but it does not reflect on the quality of the product. While the tapes at times are socialogically and technologically dated in their subject matter, the foreign languages haven't changed all that much.

The script: I used Python 3.4 with the urllib module's request method. The main challenge was getting the url's of the mp3's right. The names are not entirely consistent. For help with this (I am using Firefox 24.3.0 on OpenBSD 5.4), I right clicked on the mp3's link and selected Inspect Element from the drop down menu:

The lower left window has the href and the link to the mp3 - if your script is not able to find the file, this is a convenient place to look.

This is the whole thing:

#!python3.4

from urllib import request

# For getting foreign language study mp3's.
# Main part of URL for French.
BASEURL = 'http://www.fsi-language-courses.org/Courses/'
MIDDLEURLI = 'French/Basic (Revised)/Volume {volume}/'
MIDDLEURLII = 'French/Basic (Revised)/Volume {0:s}/'
BASEURLEND = 'FSI - French Basic Course (Revised) '

# Format changes inexplicably at chapter 19.
# Grrrr . . .
URLI = BASEURL + MIDDLEURLI + BASEURLEND
URLI += '- Volume {volume} - Unit {unit:0>2d} '
URLI += '{unit:0>2d}.{section:0>2d}.mp3'

URLII = BASEURL + MIDDLEURLII + BASEURLEND
URLII += '- Volume {1[volume]:d} - Unit {1[unit]:0>2d} '
URLII += '{1[unit]:0>2d}.{1[section]:d}.mp3'

# Format for actual name of mp3 files.
# This is what I wanted for a name - your
# preferences may be different - adjust
# accordingly.
FILENAME = '{unit:0>2d}{section:0>2d}.mp3'

# Texts (pdf format).
# Everything the State Dept. does is a 'StudentText' -
# fair enough.
STUDENTTXT = 'StudentText.pdf'

PDFURLBASICTEXT1 = 'http://ia601400.us.archive.org/28/items/'
PDFURLBASICTEXT1 += 'Fsi-FrenchBasicCourserevised-StudentText/'
PDFURLBASICTEXT1 += 'Fsi-FrenchBasicCourserevised-Volume1-'

PDFURLBASICTEXT2 = 'http://ia801400.us.archive.org/28/items/'
PDFURLBASICTEXT2 += 'Fsi-FrenchBasicCourserevised-StudentText/'
PDFURLBASICTEXT2 += 'Fsi-FrenchBasicCourserevised-Volume2-'

PDFURLMONDEFR = 'http://ia600406.us.archive.org/3/items/'
PDFURLMONDEFR += 'Fsi-LeMondeFrancophone/Fsi-LeMondeFrancophone-'

TWO = 'Two'

# Tack on StudentText.pdf to end.
pdfs = [PDFURLBASICTEXT1, PDFURLBASICTEXT2, PDFURLMONDEFR]
pdfs = [pdfx + STUDENTTXT for pdfx in pdfs]
myfilenames = ['basictext1.pdf', 'basictext2.pdf', 'mondefrancophone.pdf']
# I'm using the dictionary keys for filenames.
pdfs = dict(zip(myfilenames, pdfs))

VOLUME = 'volume'
UNIT = 'unit'
SECTION = 'section'

# volume key, then list of two tuples of unit and
# number of sections
VOLUMES = {1:[(1, 6), (2, 6), (3, 6), (4, 7), (5, 7),
              (6, 3), (7, 11), (8, 10), (9, 11), (10, 9),
              (11, 9), (12, 4)],
           2:[(13, 8), (14, 9), (15, 10), (16, 9), (17, 11),
              (18, 7), (19, 9), (20, 8), (21, 8), (22, 7),
              (23, 8), (24, 6)]}

mp3s = []
for key in VOLUMES:
    for unitsection in VOLUMES[key]:
        for x in range(1, unitsection[1] + 1):
            mp3s.append({VOLUME:key, UNIT:unitsection[0], SECTION:x})

for mp3x in mp3s:
    # Name format change at chapter 19 :-(
    if mp3x[UNIT] > 18:
        urlx = URLII.format(TWO, mp3x)
    else:
        urlx = URLI.format(**mp3x)
    filenamex = FILENAME.format(**mp3x)
    print('Retrieving {0} . . .'.format(urlx))
    request.urlretrieve(urlx, filenamex)

# Add pdf texts at end.
for pdfx in pdfs:
    print('Retrieving {0} . . .'.format(pdfx))
    request.urlretrieve(pdfs[pdfx], pdfx)

print('Everything appears to have downloaded.')
print('Check the directory with the files to be sure.')

As for my French efforts, I've had better luck downloading this stuff than I have learning it. Nonetheless, a quick message to Guido van Rossum and the other core devs: transmettez-leur mon meilleur souvenir.

pyright

Sunday, October 12, 2014

Downloading a Bunch of MP3's off the Internet (Foreign Language Tapes)

1 comment:

About Me

Blog Archive