Wednesday, September 19, 2007

The Sapir-Whorf Hypothesis

"When all you have is a hammer, everything looks like a nail." -- Bernard Baruch

In Just In Time Learning, I argued that we should concentrate on learning the abstract rather than the concrete, for example new design patterns instead of new programming languages, because of their broader applicability. Now it may appear that I am going to argue for just the opposite.

One of the best ways I've found to learn new design patterns is to learn new programming languages that depend upon those patterns. One of the problems I have with using C++ to teach object oriented programming is that you don't have to use OOP to use C++. This was by design. It allows C++ to more easily and incrementally penetrate into legacy C code bases. You can use C++ as a "better" C, and ignore all the OO stuff. But this allows you to cheat if you are using C++ to learn OOP. Better to use a language like Java, which forces you to use OOP. Well, I suppose you can cheat in Java too, if you care to make everything static, but it's a heckuva lot more work than to just do it the right way straight off. Learning Java forces you into an OO way of thinking.

This is a variation of the Sapir-Whorf Hypothesis. No, Sapir-Whorf doesn't have anything to do with Klingons, although it may have something to do with the Klingon language. In the early to mid-twentieth century, linguist and anthropologist Edward Sapir and his student Benjamin Whorf formed a hypothesis that suggested that any natural language conditions the thoughts of the people that use it. In other words, not only does human culture influence the language used by that culture, but that the language in turn influences the thoughts of its speakers. If your natural language required that every sentence started with "Please", would children raised in such a society inevitably be more polite?

The hypothesis is essentially untestable, and has hence been controversial. Separating people from language for the purposes of a double blind study is impossible. But as a computer scientist and software developer, I find it hard not to be compelled by it. Because not only to do the design patterns you use influence what programming language you may choose to develop in, the programming language you develop in may influence what design patterns you choose.

In undergraduate and graduate school (this was back in the Dark Ages, you understand), I was lucky enough to have been exposed to a wide range of programming languages. C, much less Java, hadn't even been invented by the time I was an undergraduate. There was a rich culture of experimentation and innovation regarding programming languages, because no one language had really taken root as the "default" language to use for software development the way C and Java have today.

For sure, FORTRAN was was the language of choice (and still is) in scientific computing, and COBOL was the lingua franca for business applications. But every developer back then knew that neither FORTRAN nor COBOL was perfect for the burgeoning backlog of desired software applications, particularly for systems programming. That's why I ended up writing thousands of lines of assembler code (frequently using complex macro packages that made it look like a higher-level language) in the 1970s, and why C was invented at Bell Labs as a kind of structured assembler language. But it also lead to dozens, maybe hundreds, of other programming languages being tossed out into the software development community to see if any of them would stick. Let a thousand flowers bloom; let a thousand schools of thought contend. Except unlike Mao, the computer science community meant it.

Some of these languages were domain specific. Some were designed around novel hardware or operating system architectures. Some tossed the whole idea of procedural programming out the window, leading to whole new techniques for designing algorithms. Learning and programming in these languages lead to a many new paradigms and patterns being imprinted on your brain.

But see, here's the thing. Many of these new schools of thought can be adopted to languages like C, C++, and Java. They can lead to new and better ways to solve problems. Ways that were more efficient, more easily understood, and even more economically maintainable.

Some of the languages I learned in those Dark Ages persist today. Some do not. Some only existed for a short while in the feverish brains of my colleagues. But from each of them I learned something that I might not have learned had I just learned C or Java. Here are some examples.

Prolog. Prolog is a logic programming language. You do not write procedures or methods. You make logical statements that are themselves either true or false, or which can be evaluated as true or false with the insertion of boolean values into variables. A Prolog program might consist of hundreds or even thousands of logical statements, some standalone, some referring to one another. When you run the program, you set the initial conditions of any input variables, and the Prolog interpreter searches the entire solution space described by the Prolog program for a path through the search tree that could succeed. (There may be more than one.) If it exhaustively goes through all possible paths without finding a path that succeeds, the program fails. Of course, there might be a print statement or other side effect along the way.

Here is a Prolog program that was once part of my USENET posting signature. It's not too hard to guess what it means.


Yep, it's a disclaimer: these opinions belong to me; they do not belong to anyone else. Hacker humor. The bang prevents the interpreter from back tracking back through the solution tree to try another path, and the fail causes the program to fail.

Prolog blew my mind because it was such a different approach to writing algorithms. It's syntax was very grammar like (I was really into formal languages and parsing at the time), and I immediately set to work using it to implement a parser for another programming language. When I submitted a sample program in that language to my Prolog program, if the sample program was syntactically correct, the Prolog program succeeded, otherwise it failed. Side effects managed the symbol table and emitted pseudo-machine code.

Lisp. I was never a Lisp hacker, but the List Programming language was my introduction to functional programming style and the idea of representing complex structures as trees and performing transforms on them. Both of these ideas, functional programming style and structure representation and transformation, became part of my every day software design tool kit to show up again and again in the systems I develop. I think if I were to encounter Lisp now, I would appreciate it so much more than I did as an undergraduate, having spent the past thirty years applying its lessons in countless systems and applications.

SNOBOL. SNOBOL is a string processing language with regular expressions. Sure, that seems mundane now, but this predates UNIX and awk. Snobol taught me how to think of strings not as collections of bytes but as malleable forms that could be described and manipulated using complex regular expressions. Years later when I starting using UNIX, writing parsers and data tranformers using awk came naturally, partly because of my prior SNOBOL experience.

Back in the day it was a little game to see how many lines of code it took to write an 80-80 lister. An 80-80 lister was a program that read in punch cards and printed their content to the line printer. Not that you really needed to write an 80-80 lister, the IBM utility IEBGENER worked just fine. But it was just a quick way to evaluate the power of a programming language. An 80-80 lister in SNOBOL looked like this.


It was almost cheating, really.

Simscript. Simscript was an invention of the RAND Corporation, that wacky place that also brought us the Delphi technique of estimation and most of the ideas behind Dr. Strangelove. Simscript is a procedural programming language with support for simulation in the same way that Java supports multithreading: as part of the language itself. I only used Simscript as a student (it is a closed proprietary language, hence hard to get to), but all during my career I found myself wishing I had it around. For some reason I keep encountering the need to write simulations with event streams whose emission rates meet some statistical distribution. What is a few thousand lines of code in C become a few hundred lines of code in Simscript. I learned the basic design patterns of simulation in Simscript as an undergraduate, and I keep applying them in my professional career over and over in other languages.

Forth. Forth is a stack-oriented threaded interpreted language that is still in use today. Here, "threaded" has nothing to do with multithreading, but indicates how the interpretor stores the intermediate form of the program. A Forth program consists of a free form stream of words and numbers. New Forth words can be trivially defined. There is no compiler. Words are parsed and either executed directly or converted into some intermediate form (typically an address) into a dictionary as you type them in. Here is the definition of a simple Forth program that squares its input.


This is a new word called SQUARE. It duplicates the value on the top of the stack so that there are two copies on the stack, then it multiplies those two values together while popping them off the stack, and pushes the result on the stack. The following Forth code squares the value of 12 and prints it out.


A useful Forth interpreter can be implemented in just a few kilobytes, and as such it remains popular in embedded circles. There is an IEEE standard for portable boot firmware that is Forth-based, and to this day you can walk up to many Solaris-based Sun workstations, hit a control sequence on the keyboard, and break into a Forth prompt from an interpreter in the boot ROM. I've embedded a Forth interpreter as a shell inside a C++ application, and have also used Forth's threaded interpretive approach in C and C++ code as a kind of self-modifying code model.

SAS. SAS is another closed, proprietary language that I've been fortunate enough to have access to several times in my career. It's a statistical analysis system, not a language really, but more like a large collection of feature-rich statistical programs with a consistent, high level interface.

I love SAS. Because simulation keeps rearing its ugly head in my work, I keep needing a way to crunch a lot of numbers and produce beautiful charts to summarize it. Spreadsheets don't cut it when you're dealing with gigabytes of data. For SAS, that's all in a days work.

SAS taught me the value of a consistent UI among a toolchain, and how that UI could look like a higher level language. It also taught me how to manage large datasets, something else it necessarily does well.

FP. FP is a pure functional programming language, pure in the sense that it has no concept of variables. It was first described by IBM fellow John Backus, the inventor of the FORTRAN programming language and Backus-Naur syntax, in his Turing Award lecture. It combines some of the ideas behind Kenneth Iverson's APL (another interesting language) with a functional programming style in an effort to liberate software design from the constraints of the traditional von Neumann computer architecture. I never actually used a working FP compiler, but the language turned out to be the partial inspiration of much research during my graduate student years.

Functional programming has informed my software design for the past thirty years, breaking me free early on from the traditional thinking of "compute a value, store the result" and the need to store partial results in variables. The functional programming paradigm maps to Java particularly well, since exceptions can be used to handle errors when using a functional design model.

BAFL. BAFL was another functional language based partly on FP. It existed only in the research group from which my thesis project emerged. What made BAFL really different from other languages, besides being functional instead of procedural, is that it had lazy evaluation. It never did a computation until it was convinced you were actually going to use the result. Lazy evaluation is the interpretive equivalent of the common compiler optimization of not generating code for parts of the program which have no effect.

I remember once we wrote this huge BAFL program to compute some mathematical function or other, turned it loose, and... nothing. We made a couple of corrections, tried it again, and... still nothing. We were looking for the bug when we realized we never told it to do anything with the result, so it never computed it. We added a print statement, and then the interpreter finally executed the huge evaluation tree that it had built.

BADJR. An implementation of BADJR, a language which only existed in the mind of my thesis advisor and mentor Bob Dixon, was my thesis project. In many ways it was the most bizarre programming language I have ever used. It had arbitrary precision fixed point arithmetic, implemented in two's complement base 100 binary coded decimal. It had no language constructs for iteration, instead opting to use efficient tail recursion in which the stack doesn't grow. And it had single-assignment variables: you could assign a variable a value once and only once, after which it was immutable.

When I describe this to people, particularly the single assignment part, a common initial reaction is "this can't work". The idea of single assignment seems more foreign than having no variables at all. But it is simply a very different programming style. The value of a particular variable in a particular stack frame can never be changed, but you can recurse and create a new copy of the same variable in a new stack frame and change that version.

BADJR was sufficiently powerful that another graduate student used it to implement a Prolog interpreter for his thesis project (and he went onto Bell Labs a decade ahead of me -- I always was the slow one). The entire research program from which my thesis emerged was inspired by Japan's Fifth Generation Computer Systems Project, which was to use AI (okay, so now you know it failed), massively parallel computers, and dataflow architectures, to take over the world. BADJR's single assignment architecture mapped well to massively parallel dataflow architectures, and the 5G project had already expressed an intent to leverage Prolog in its AI work. So we thought this was pretty cool stuff.

And it was, in its own way. BADJR pounded recursion as a school of thought into me like no other effort had. It's arbitrary precision arithmetic taught me how multiplication and division actually worked (and led to Mrs. Overclock giving me a copy of Knuth, including the ever-so-critical Seminumerical Algorithms, as a gift). And the two garbage collectors I had to implement for it (reference counting, and mark-and-sweep) helped me appreciate the issues underlying GC in the JVM decades later.

Assembler. Sure, laugh. I learned how and why programming languages work under the hood by writing tens of thousands of lines of IBM BAL and PDP-11 PAL assembler code (and made me a real C zealot as a result). That experience has served me well in my embedded work, when figuring out some of the more esoteric Linux kernel macros, and in fixing the occasional errant DSP application.

But more than that: my early assembler work circa 1976 was my introduction to object oriented design. No, really. The OS/360 I/O subsystem's assembler code introduced me to polymorphism, inheritance, modularity, and encapsulation, and to such OO mechanisms as dynamic binding and virtual tables. When C came along to be the portable assembler language I had always wanted, I used some of these same techniques, admittedly primitively implemented. When C++ finally came along, I breathed a sigh of relief: finally, a language that does all this for me.

Okay, maybe I really wasn't that smart. I'm not sure I really appreciated all those design patterns in the IOS. I just knew a good idea when I saw it, and the first rule of engineering is: steal all the good ideas.

Today I might recommend languages like Ruby and Haskell, although Prolog, Lisp, Simscript, Forth and SAS are still alive and kicking, and still very much worth learning just for the purposes of bending your head around their very different ideas of how to code. And assembler, maybe particularly Intel P4 assembler, still has its place too.

Learning new, and more importantly, radically different, programming languages will make you a better developer even if in your day job all you use is C or Java.

What new and different programming languages would you recommend?


mcjoe said...

Intersting discussion, Chip!

FYI, Ted Neward will be presenting SCALA on the JVM at the Boulder JUG.

Neal Ford has a great presentation on polyglot programming (couldn't find a link to the preso), which is the use of multiple languages in the same project. He and Martin Fowler also have a great talk on Domain Specific Languages, which is very interesting.

As far as languages, I am really enjoying using Groovy, which is a dynamic language built on top of Java and that runs on the JVM.

Chip Overclock said...

I remember the good old days where all I was concerned with was getting some assembler code to work with FORTRAN, COBOL, or (later) C and C++. I once worked (along with a cast of thousands) on a eight million line code base. While eating lunch at my desk one day I decided to write some simple shell scripts to see how many languages were in the code base. I think the answer was something like 23, if you count all the unique configuration file formats too. Every possible UNIX shell was represented, along with lots of tools like awk and perl, maybe some tcl, and of course the usual C and C++. I think there are two competing forces here: the desire to use the right tool for the job, the desire to use the only tool the developer knows, the desire to use the tool that gets the job done quickly so you meet your dates and don't get fired, and the desire to use a new tool whether it's the right tool or not.

Anonymous said...

The second clause of the Prolog program is totally unnecessary.

Chip Overclock said...

Yeah, but it's funnier the way I wrote it. It's art, not code.

Chip Overclock said...

My old friend and colleague David Hemmendinger tells me that C was around when I was first an undergraduate, but it was barely out of Bell Labs at the time. I suspect I was just too ignorant to know of it (and it wasn't being taught where I was going to school).

Snobol4 said...

My apologies for commenting on an an older post, but the content was quite enjoyable to read.

The reference to IEBGENER brought back a flood of memories. It evokes a vague memory of a rather cleverly named module or message that used the letters IGDZLL, referred to as "I,Godzilla".

The paragraph on Forth was especially nostalgic, bringing back memories of a Byte Magazine issue titled "The Forth Language". I recall spending several months developing a Z80 assembly language port of the 8080 Forth interpreter listed on the pages of the magazine. While not a mainstream language, I still find myself contemplating its usage with an embedded controller for an electro-mechanical arcade project.

Thanks for the excellent post!

Chip Overclock said...

Snobol4, you must be OLD!

Apropos of nothing, really, Mrs. Overclock (a.k.a. Dr. Overclock, Medicine Woman) and I are going to be in Seattle towards the end of February to celebrate our 25th wedding anniversary by going to the Science Fiction Museum!