Posts Tagged ‘Hackery’

Packing It All In: Distributing Python With an App

Tuesday, May 13th, 2008

Python has lovely built-in distribution tools. They’re great to use if you need a nice, repeatable, easy way to distribute your source code and have it install cleanly on a platform that has its $PATH set up correctly. However, if you want to distribute Python as part of a commercial software package, to platforms that may not even have Python installed, the procedure is not as clean or clear-cut. We devised a way to do it that mostly works, though we have to tweak it somewhat for each release. I’ll show you here our method for doing just that, using the Snakefood program for dependency extraction and a custom script to fill in the gaps that Snakefood can’t quite bridge.

Python is an interpreted language, which means, very basically, that it will not compile down to something that will run natively on any platform. The standard way to get Python to operate is to use the CPython interpreter, a program written in C that reads Python code performs the actions it describes (called “interpreting” it). There are other options, too, like Jython and IronPython, which do basically the same thing as CPython except that they translate the Python code to Java and .NET, respectively. We stick with C. After all, the whole reason we’re doing any of this is that we can’t count on Python being installed. We certainly can’t count on Java of .NET being installed.

As a very basic step one, we need to bundle the CPython interpreter with our app. It’s only about 15MB and is highly compressible, so we can easily include the interpreter, but the standard libraries in Python make for a fairly large installation: the estimated size of Python 2.5.2 is about 180MB. Even if we compress that, it’s still a huge download and a not-so-inconsequential amount of hard drive space. The good news is that we don’t use all of the standard libraries. The even better news is that there’s a pretty simple way of extracting only the files you do need and packaging them into a much smaller distribution. The trick up our sleeve is a small program written in Python called Snakefood. It’s not perfect, but I’ll show ways to get the most out of it.

The first step, of course, is getting Snakefood and installing it. If Python is in your $PATH, just extract the source, then run:

% python setup.py install

from the Snakefood directory, which will install Snakefood to wherever your current Python installation is. You can then run it with:

% python sfood <target file>

from any directory. The target file is the main script of your program. With just that command, it will pull the dependencies from the ‘import’ statements in your main script. That’s probably not good enough, so use the option --follow, which follows all the import statements in each of the imported modules to their leaves. That gets most of what you need.

The output of running Snakefood on a target is not entirely intuitive. It is a list of tuples like the following:

((<source_package_root>, <source_file.py>), (<dest_package_root>, <dest_file.py>))

But sometimes the entry looks like this:

((<source_package_root>, <source_file.py>), (None, None))

It may be tempting, but you can’t skip these lines.

The format of the dependencies tells you that <source_file.py> depends on <dest_file.py>, so you need to preserve it in your pared-down distribution. For us, this is as simple as making a new directory called dist/, and copying the file at path os.path.join(<dest_package_root>, <dest_file.py>) into it. You can make a list of these files directly from the Snakefood output (piped from stdin) with the following script:

import sys   
import os   
files = set()
for dep in map(eval, sys.stdin):
    if dep[1][0] is not None:
        path = os.path.join(dep[1][0], dep[1][1])
        files.add(path)
    else:
        path = os.path.join(dep[0][0], dep[0][1])
        files.add(path)

Now take this set of files and copy them into your new directory. Preserving the directory hierarchy is nontrivial, but not that hard. Hopefully, you have already created a custom Python installation so that all of the relevant files are in one place anyway. From there, you must find the root of the dependency tree. My custom Python installation is at /Users/matthewmoskwa/ExpanDrive/python, so on each path in the file set, I split on 'python' and copy the new path into the dist/ directory (making sure to create new directory nodes first):

import shutil
for fi in files:
    distPath = os.path.join('dist', fi.split("python")[1])
    if not os.path.exists(os.path.dirname(distPath)):
        os.makedirs(os.path.dirname(distPath))
    shutil.copy(fi, distPath)

At this point, the writer of Snakefood claims 99% accuracy. I haven’t measured that claim, but I have found a major drawback: Snakefood misses all __init__.py files, and therefore any import statements in those files. Rather than being smart about it, I just use os.walk() to find all the __init__.py files and copy them into dist/. I then ru my code from dist/ and look for ImportErrors. When I see one, I modify my script to manually copy the missing file to dist/. Not perfect, but it works, and it’s still much faster than doing the whole thing by hand.

The final step is to compile all of the files down to .pyo and remove all the .py and .pyc files. We use a Python script called compileall.py, located in the standard library, to compile, and then

% find . -type f -name '*.pyc' -print0 | xargs -0 rm -rdf

to remove the files. Make sure to run compileall.py with the -OO option to get rid of docstrings and other unnecessary stuff.

Until someone writes an OS in Python or all OSes are guaranteed to have Python installed, this is a pretty good way to distribute Python code to the masses. The next step, actually getting it to run like an application, is up to you, though py2app and py2exe can certainly help.

ExpanDrive Version 1.2

Monday, May 12th, 2008

Fresh off the press, out today, come and get it while it’s hot. Since 1.2 seems to be the magic number, that’s what we’re calling ours too.

Big ticket items: free space remaining now displays correctly on servers that support python. A filter field’s been added to the Drive Manager for those of us that have oh-so-many drives. Public key support is far more robust - in addition, encrypted private keys are also now supported.

Also, you might want to try a little Dino Run.

Finessing international characters out of Python

Tuesday, May 6th, 2008

Whilst we whittled our filesystem problems down to a remaining few and sent our first Release Candidate out into the wild, we discovered we had another specter on the horizon to deal with: International Filename Support. Python generally handles this pretty well: it defaults to the web standard, UTF-8, so if you received a UTF-8 string, python will print the correct representation upon your call to “print”. No other work is necessary. This does not go so smoothly if the string you get is not encoded in UTF-8 (or ascii, since it is a true subset of UTF-8). We learned this limitation, and how to overcome it, over the course of two frustating days.

In our testing, we used another commercial SFTP Client to put some files with international characters in their names onto our test server (to wit: the files were called Québécois and Dvořàk). Unbeknownst to us, the client we used defaulted to Latin-1, aka ISO-8859-1 encoding. However, at this point, we also did not know about encoding in python, so we just output the strings as we received them. What we saw was Qu?b?cois and Dvo??k from the Terminal, and even worse in Finder, Qu? and Dvo? (more on why this was so later).

Python does not auto-detect encodings. You can get some third-party modules to get Python to try and do this.

We knew we had international characters, and we also knew that Mac OS X likes its characters to be encoded as UTF-8 (sort of).

So we tried this:

output_string = input_string.encode('utf-8')

Exception!

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

It looks like python is guessing the string is ASCII. We think it’s UTF-8, so let’s try it again:

ouptut_string = input_string.decode('utf-8').encode('utf-8')

Exception!

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2-4: invalid data

Oh dear. At this point, I insisted the client we were using was definitely not encoding filenames as UTF-8 data, but Jeff insisted that it had to be (it’s the standard, after all). Then we had an argument about the semantics of decoding vs. encoding. On a whim, I tried decoding the string using ‘latin-1′ as an argument. Ta da! No more Unicode exception! We came to the following conclusion about python encoding/decoding: python always stores strings in an internal, canonical representation. Therefore strings are always implicitly decoded from ASCII to this form.

In short, python does this with every incoming string:

canonical_string = decode(input_string, 'ascii')

output_string = encode(canonical_string, 'ascii')

If the incoming strings are not ASCII-encoded, you must explicitly call decode() on them with the appropriate codec as an argument. Our codec in this case is Latin-1 (aka ISO-8859-1); so far so good.

Now that we have our string object, we must call encode() on it with ‘utf-8′ as an argument, since UTF-8 is almost what Mac OS X expects. I say “almost” because there are two possibilities for UTF-8 encoding: “Canonical From” and “Decomposed Form”. The difference is in how characters with diacritics, like à or é, are transmitted. Mac OS X uses decomposed form, which simply means that à is transmitted as two characters, ` and a, which are then combined. Python defaults to canonical form, so before we re-encode the strings as UTF-8, we’ve got to make this switch.

import unicodedata       
decomposed_string = unicodedata.normalize('NFD', \
   input_string.decode('latin-1'))

Now we can finish up the task.

output_string = decomposed_string.encode('utf-8')

Hooray! We’re done.

But wait… what happens if some other client uses a different encoding? Well, of course the characters will display incorrectly. We need some sort of default encoding that will work. We saw above that using UTF-8 as a default will not work, since there are encodings of characters in latin-1 (and probably other codecs) that are invalid in utf-8. We settled on defaulting to ASCII. This is acceptable in all cases because of a basic truth about text encoding: every single character is transmitted as at least one byte of data. ASCII has a printable representation of every possible byte. So while the character à does not have an encoding in ASCII, its byte sequence, \xc3\xa0, does, though it will usually just print as ?? since both those numbers are greater than 0x7F and ASCII is not standardized above 0x7F.

Putting it all together, this is basically the function we use to handle these strings.

import unicodedata

def re_encode(input_string, decoder = 'utf-8', encoder = 'utf=8'):   
   try:
     output_string = unicodedata.normalize('NFD',\ 
        input_string.decode(decoder)).encode(encoder)

   except UnicodeError:
     output_string = unicodedata.normalize('NFD',\ 
        input_string.decode('ascii', 'replace')).encode(encoder)
   return output_string

And that’s really all there is to it. Python wins the game. By defaulting to ASCII encoding, you won’t get any unhandled exceptions, and you’ll also know pretty quickly that something is wrong (just look for the ???????s). For a much lengthier discussion of what Unicode is and does, see Joel Spolsky’s verbose take on the matter.

Sharing a Virtual Machine between VMWare Workstation and Fusion

Tuesday, October 23rd, 2007

Here is how to share a VM between Windows-based VMWare Workstation and Mac-based Fusion:

  1. Create a large FAT32 partition. You can either carve up your primary hard drive using something like Partition Magic - or do something more sane like buy external Firewire drive [USB drive performance on OS X is abysmal]. I own this Firelite drive which is powered over Firewire and also this Firewire 800 G-Tech drive. Let me re-emphasize: get a Firewire drive, USB is painfully slow.
  2. Format the drive using Disk Utility with the ‘MS-DOS’ filesystem. Windows, for no apparent reason, refuses to format a FAT32 volume larger than 32GB - so the format must be done in Disk Utility.
  3. FAT32 is limited to 4 GB files, so you’ll need to make sure your virtual disk is split in to into 2GB segments. It’s easy to specify this option during VM creation or you can convert an existing VM with the command line VMWare disk utilities. I recommend Robert Petruska’s DiskManager GUI, which makes things much easier. I recommend copying the virtual disk to a local drive first, it’ll save a lot of time.
  4. Modify the VM configuration to point at the split disk you just converted, and you’re good to go!

The only real drawback is that Fusion cannot do much [anything] with the tree of snapshots in created in workstation.

Google Calculator from the Command Line

Tuesday, August 21st, 2007

I found this the other day - a command line version of Google Calculator. Very cool!

$ gcalc
gcalc version 0.1 by Greg Miller
Usage: gcalc [-d]
example:  gcalc "5+2*2"
example:  gcalc 5!
example:  gcalc "sqrt(-4)"
example:  gcalc "160 pounds * 4000 feet in calories"
example:  gcalc avogadros number
example:  gcalc 0b110111010 + 0x33 in decimal
example:  gcalc 22 lira in yen
example:  gcalc 2 to the power of 5

VMWare Fusion high CPU usage hint

Thursday, August 16th, 2007

Whether iTunes or VMWare Fusion v1.0 is the culprit, I’m not sure. I’ve been getting high CPU usage while my guest OS is idle in VMWare Fusion 1.0 final. The solution, noted elsewhere, is to disable ituneshelper.exe using autoruns or msconfig.exe. vmware-vmx CPU usage with an idle guest OS went from 38% to 8%

Coconut Wifi - Airport Menubar replacement

Wednesday, July 25th, 2007

This is what I’ve been looking for! I’ve complained before that the default Airport dropdown is hopelessly inadequate if you’re looking to discover an open access point, or select one that has the strongest signal. Thankfully, this guy went and made it happen. Awesome.

Here is a screenshot from their website that shows you what’s up:

coconutwifi1.jpg

Parallels to VMWare disk image conversion

Friday, July 13th, 2007

I’ve gotten a few questions regarding my Switch to VMWare Fusion post - namely, how do you go about converting your existing Parallels virtual disk so it’ll run inside VMWare Fusion.

Unexpected answer - VMWare Converter. This free tool is designed to convert a physical machine into a VMWare format virtual machine. Nothing says you can’t have it convert a virtual machine, usually, it just doesn’t make much sense.

You’ll need to either install VMWare converter inside your Parallels VM or do a “remote” connection to it, set a few configuration options and then let’r rip. Hope you’ve got some extra hard drive space, you’ll need room to store the additional copy of your virtual hard drive while the conversion is being performed. Enjoy.

UPDATE: VMWare has some detailed instructions on this process in their forums

VMWare Internet Connection Sharing appliance

Friday, June 22nd, 2007

I admit, I have never quite understood the push behind these VMWare appliances. For me, they fall into the giant soup of enterprise products that I can’t imagine ever using. That being said, I’ve finally found one that is quite useful for me, the non-enterprise user.

Supposedly simple task: Share the wireless connection of my IBM Thinkpad with my MacBook Pro.

Windows Internet Connection Sharing [ICS] has proven super flakey and slow, not to mention its complete lack of advanced options. After 45 minutes of pain, I gave this VMWare appliance a shot. I set up one VMNet1 to run NAT and DHCP against the host and VMNet2 to bridge with the ethernet connection. Set the VM to boot when the Thinkpad powers up. 32 Meg memory footprint. Good to go.

It just works, and the performance is fantastic.

Awesome.

I’ll also use this post to give a shout out to the best-free-product-in-the-universe, VMWare GSX Server. It really is quite amazing. It has nearly all the power of VMWare Workstation, and has some extra cool features of its own. Did I mention it is completely free?