People have invented a number of theories regarding the relative lack of women in the software industries. I don’t intend to opine on each one of those theories. I’m pretty sure that it’s not a lack of ability, because the women who are in software tend to be quite good, and seem to be more capable on average than men in the industry. I’ve met very good male and female programmers, but the bad ones tend to be men. Now, some people attribute the rarity of female programmers to the macho cultures existing in badly managed firms, and I think that that’s part of it, but I also see the negative cultural aspects of this field resulting from the gender imbalance, rather than the reverse. After all, women had to overcome cultures that were just as bigoted and consumptive (if not moreso) as the sort found in the worst software environments, in order to break into medicine and law; but eventually they did so and now are doing quite well. What’s different about software? I offer a partial explanation: social and industrial dimensionality. Perhaps it is a new explanation, and perhaps it’s a spin on an old one, but let’s dive into it.
In machine learning, dimensionality refers to a problem where there are a large number of factors that might be considered, making the problem a lot more complicated and difficult, because it’s usually impossible to tell which dimensions are relevant. Many problems can be framed as attempting to estimate a function (predicted value according to the pattern the machine is trained to learn) over an N-dimensional space, where N might be very large. For example, Bayesian spam filtering implements a K-dimensional regression (specifically, a logistic regression) where K is the number of text strings considered words (usually about 50,000) and the inputs are the counts of each.
A 10- or even 100-dimensional much space is larger than any human can visualize– we struggle with three or four, and more than 5 is impossible– but algebraic methods (e.g. linear regression) still work; at a million dimensions, almost all of our traditional learning methods fail us. Straightforward linear algebra can become computationally intractable, even with the most powerful computers available. Also, we have to be much more careful about what we consider to be true signal (worth including in a model) because the probability of false positive findings becomes very high. For some intuition on this, let’s say that we have a million predictors for a binary (yes/no) response (output) variable and we’re trying to model it as a function of those. What we don’t know is that the data is actually pure noise: there is no pattern, the predictors have no relationship to the output. However, just by chance, a few thousand of those million meaningless predictors will seem very highly correlated to the response. Learning in a space of extremely high dimensionality is an art with a lot of false positives; even the conservative mainstay of least-squares linear regression will massively overfit (that is, mistake random artifacts in the data for true signal) the data, unless there is a very large amount of it.
We see this problem (of overfitting in a space of high dimension) in business and technology. We have an extreme scarcity of data. There are perhaps five or six important, feedback-providing, events per year and there are a much larger number of potentially relevant decisions that lead to them, making it very hard to tell which of those contributed to the result. If 20 decisions were made– choice of programming language, product direction, company location– and the result was business failure, people are quick to conclude that all 20 of those decisions played some role in the failure and were bad, just because it’s impossible to tell, in a data-driven way, which of those decisions were responsible. However, it could be that many of those decisions were good ones. Maybe 19 were good and one was terrible. Or, possibly, all 20 decisions were good ones in isolation but had some interaction that led to catastrophe. It’s impossible to know given the extreme lack of data.
Dimensionality has another statistical effect, which is that it makes each point an outlier, an “edge case”, or “special” in a way. Let’s start with the example of a circular disk and define “the edge” to mean all points that are more than 0.9 radii from the center. Interior points, generally, aren’t special. Most models assume them to be similar to neighbors in expected behavior, and their territory is likely to be well-behaved from a modeling perspective. In two dimensions– the circular disk– 81% of the area is in the interior, and 19% is on the edge. Most of the activity is away from the edge. That changes as the number of dimensions grows. For a three-dimensional sphere, those numbers are 72.9% in the interior, and 27.1% at the edge. However, for a 100-dimensional sphere, well over 99% of the volume is on the edge, and only 0.9^100, or approximately 0.0027%, is in the interior. At 1000 dimensions, for all practical purposes, nearly every point (randomly sampled) will be at the edge. Each data point is unique or “special” insofar as almost no other will be similar to it in each dimension.
Dimensionality comes into play when people are discussing performance at a task, but also social status. What constitutes a good performance versus a bad one? What does a group value, and to what extent? Who is the leader, the second in charge, all the way down to the last? Sometimes, there’s only one dimension of assessment (e.g. a standardized test). Other times, there can be more dimensions than there are people, making it possible for each individual to be superior in at least one. Dimensionality isn’t, I should note, the same thing as subjectivity. For example, figure-skating performance is subjective, but it’s not (in practice) highly dimensional. The judges are largely in agreement regarding what characterizes a good performance, and disagree largely on (subjective) assessments of how each individual matched expectations. But there aren’t (to my knowledge, at least) thousands of credible, independent definitions of what makes a good figure skater. Dimensionality invariably begets subjectivity regarding which dimensions are important, i.e. on the matter of what should be the relative weights for each, but not all subjective matters are highly dimensional, just as perceived color is technically subjective but generally held to have only three dimensions (in one model; red, green, and blue).
Social organizations also can have low or high dimensionality. The lowest-dimensional organization is one with a strict linear ordering over the people. There’s a chief, a second-in-command, a third and fourth, all the way down to the bottom. If you’re 8th out of 18 people, you know that you’re not 7th or 9th. Social status has one legible dimension. Status is determined by one legible fact about a person: the specific assigned rank. Typical hierarchies aren’t that way; while there is a rigid status pyramid, same-rank people are not comparable against one another. In most organizations and groups, leaders are visible, and omegas might be legible because they serve a social purpose, too; but in the middle, it’s almost impossible to tell. Venkatesh Rao goes into a lot of detail on this, but the general rule is that every social group will have a defined alpha and omega, but members 2 to N-1 are incomparable, and the cohesion of the group and the alpha’s positional stability often require this. An independent #2, after all, would represent eventual danger to the chief, which is why proteges are only selected by alphas who plan on graduating to a higher-status group. What is it that keeps social status illegible within the group? Dimensionality. People have been comparing themselves against each other forever; what prevents the emergence of a total linear ordering is the fact that different dimensions of capability or status will produce different rankings, and there’s uncertainty about which matter more.
The group might have one main social status variable, and will usually arrange it so that only one person (or, at scale, a few) have truly elevated status in that dimension, because that’s necessary for stability and morale. Fights over who is #2 vs. #3 are an unwanted distraction. This leaves it to the people within the group to define social status how they will, and the good news for them is that most people can find things or combinations of things at which they’re uniquely good. People find niches. In The Office, people who will never be managers take solace in joining clubs like the Party Planning Committee and the Finer Things Club. People like to become “the <X> guy” (or gal) for some X that makes them important to the group, e.g. “I’m the Friday Cupcake Guy”. It gives each person an additional dimension of social status at which he or she can be an undisputed local chieftain– possibly of territory no one wants, but still the owner of something.
What might this have to do with programming? Well, I’ve often asked (myself, and others) what makes a great programmer, and the conclusion that I’ve come to is that it’s very hard, at an individual level, to tell. Speaking broadly, I can say that Clojure (or Lisp) programmers are better than Java programmers, who are in turn better than VB programmers. I know the patterns and the space, and that’s clearly true (in the aggregate). Better programmers like more challenging, but ultimately more powerful, languages. But there are excellent Java programmers and bad Lisp hackers. Also, if you bring a crack Clojure or Haskell developer into a typical enterprise Java environment where things are done in a wrong but common way, she’ll struggle, just because she’s not familiar with the design patterns. Moreover, a person’s reputation in a new job (and, in the long term, status and trajectory) depends heavily on his performance in the first few months, during which familiarity with the existing technology practices and tools (“tech stack”) have more of an impact than general ability. In the short term, it can be very hard to tell who the good and bad programmers are, because so much is project-specific. People are often judged and slotted into the typical software company’s pseudo-meritocracy before sufficient data can be collected about their actual abilities.
Assessing programmers is, to put it simply, a high-dimensional problem. There are more important and possibly powerful technologies out there (and plenty of duds, as well) than there is time to learn even a small fraction of them, and a lot of the know-how is specific to subdomains of “programming” in which one can have a long, fruitful, and deep career. Machine learning requires a dramatically different skill set from compiler design or web development; a top machine-learning engineer might have no idea, for example, how to build a webpage. People in business are judged shallowly (indeed, 95% of success in “business”, in the U.S., is figuring out how to come out on top of others’ shallow– but usually predictably so– judgements) and programming is rarely an exception, so when a person tries something new, there will be “embarrassing” gaps in his or her knowledge, no matter how capable that person is on his or her own territory. There might be 1000 dimensions that one could use to define what a good vs. bad programmer is, and no one excels at all of them.
Given the extreme dimensionality of assessing programmers, I also contend that self-assessment is very difficult. Good programmers don’t always know that they’re good (it can be frustrating and difficult even for the best) and many bad ones certainly think that they’re good. I don’t think that there are huge differences is self-confidence between men and women, individually speaking. Differences between groups are often smaller than those within groups, and I think that this applies to self-efficacy. However, I think the effect of dimensionality is that it can create a powerful feedback loop out of small personal biases over self-efficacy– and I do believe that men are slightly more inclined to overconfidence while women, in the aggregate, are slightly biased in the other direction. Dimensionality gives leverage to these tendencies. A person slightly inclined to view herself as competent will find reasons to believe she’s superior by selecting her strongest dimensions as important; one inclined the other way will emphasize her shortcomings. Dimensionality admits such a large number of ways to define the rules that a person’s view of him- or herself as a competent programmer is extremely flexible and can be volatile. It’s very easy to convince oneself that one is a master of the craft, or simply no good.
When starting in the software career, women (and men) have to deal with, for just one irritating example, socially undeveloped men who display obnoxious surprise when they’re new to programming trivia. (“You don’t know what xargs is?”) They also had to deal with those assholes when breaking into medicine and law as well, but there was a difference. The outstanding female doctors, for the most part, knew that they were competent and, often, better than the jerks hazing them. That was made obvious by their superior grades in medical school. In software, newcomers deal with environments where the dimensions of assessment often change, can sometimes (e.g. in “design pattern” happy Java shops) even be negatively correlated with actual ability, and are far outside of their control. Dimensionality is benevolent insofar as it gives people multiple avenues toward success and excellence, but can also be used against a person; those in which the person is weak might be selected as the “important” ones.
A piece of ancient wisdom, sometimes attributed to Eleanor Roosevelt although it seems to pre-date her, is: great minds discuss ideas, middling minds discuss events, and small minds discuss people. This is extremely true of programming, and it relates to dimensionality quite strongly.
Weak-minded programmers are actually the most politically dangerous; they don’t understand fuck-all, so they fall back to gossip about who’s “delivering” and who’s not, and hope to use others’ fear of them to extort their way into credibility, then garner a lateral-promote into a soft role before their incompetence is discovered. As expected, the weak-minded ones discuss people.
The great programmers tend to be more interested in the ideas (concurrency, artificial intelligence, algorithms, mathematical reasoning, technology) than in the tools themselves, which is why they’re often well-versed in many languages and technologies. Through experience, they know that it’s impossible to deal with all of computer science while limited to one language or toolset. Anyway, Hadoop isn’t, on its own, that interesting; distributed programming is. It’s what you can do with these tools, and what it feels like to use them on real problems, that’s interesting. Great minds in programming are more attracted to the fundamentals of what they are doing and how they do it than the transient specifics.
However, most minds are middling; and most programmers are the middling kind. Here, discussion of “events” refers to industry trends and the more parochial trivia. Now, great programmers want to narrow in on the core ideas behind the technologies they use and generally aren’t interested in measuring themselves in relation to other people. They just want to understand more and get better. Bad programmers (who usually engineer a transition to an important and highly compensated, but not technically demanding, soft role) play politics in a way that is clearly separate from the competence of the people involved; because they are too limited to grapple with abstract ideas, they focus on people, which often serves them surprisingly well. In other words, the strongest minds believe in the competition of ideas, the eventual consistency thereof, but stay away in general from the messy, ugly process of evaluating people. They shy away from that game, knowing it’s too highly dimensional for anyone to do it adequately. The weak minds, on the other hand, don’t give fuck-all about meritocracy, or might be averse to it, since they aren’t long in merit. They charge in to the people-evaluating parts (“Bob’s commit of April 6, 2012 was in direct violation of the Style Guide”) without heeding the dimensionality, because getting these judgments right just isn’t important to them; participating in that discussion is just a means to power.
Middling programmers, however, understand meritocracy as a concept and are trying to figure out who’s worth listening to and who’s not (the “bozo bit”). They genuinely want the most competent people to rise, but they get hung up on superficial details. Oh, this guy used Java at his last job. He must be a moron. Or: he fucking wrote a git commit line over 65 characters. Has he never worked as a programmer before? They get tricked by the low-signal dimensions and spurious correlations, and conclude people to be completely inexperienced, or even outright morons, when their skill sets and stylistic choices don’t match their particular expectations. These middling minds are the ones who get tripped up by dimensionality. Let’s say, for the sake of argument, that a problem domain has exactly 4 relevant concepts. Then there might be 25 roughly equivalent, but superficially different, technologies or methods or resume buzzwords that have been credibly proposed, as some time, as solutions. Each class of mind ends up in a separate space with differing dimensionalities. Great minds apprehend the 4 core concepts that really matter and focus on the tradeoffs between those. That means there are 4 dimensions. Low minds (the ones that discuss and focus on people) have a natural affinity for political machinations and dominance-submission narratives, which are primordial and very low in dimensionality (probably 1 or 2 dimensions of status and well-liked-ness). The middling minds, however? Remember I said that there are 25 slightly different tools for which familiarity can be used as a credible assessor of competence, and so we end up with a 25-dimensional space! Of course, those middling minds are no more agile in 25 dimensions than anyone else– we just can’t visualize more than two or three at a given time– which is why they tend to narrow in on a few of them, resulting in tool zealotry as they zero in on their local high grounds as representing the important dimensions. (“<X> is the only web framework worth using; you’re a moron if you use anything else.”)
I’ve been abstract and theoretical for the most part, but I think I’m hitting on a real problem. The mediocre programmers– who are capable of banging out code, but not insightful enough to be great at it, and far from deserving any credibility in the evaluation of others– are often the most judgmental. These are the ones who cling to tightly-defined Right Ways of, for example, using version control or handling tabs in source code. One who deviates even slightly from their parochial expectations is instantly judged to be incompetent. Since those expectations are the result of emotionally-laden overfitting (“that project failed because Bob insisted on using underscores instead of camelCase!”) they are stances formed essentially from random processes– often pure noise. But as I said before, it’s easy with a high-dimensional problem to mistake sparse-data artifacts (noise) for signal.
In other words, if you go into programming as a career, you’ll probably encounter at least one person who thinks of you as an idiot (and makes the sentiment clear) for no reason other than the fact that specific dimensions of competence (out of thousands of candidate dimensions) that he’s pinned his identity on happen to be the ones in which you’re weak. It’s shitty, it’s “random”, but in the high-dimensional space of software it’s almost guaranteed to happen at least once– especially when you’re starting out and making a lot of genuine mistakes– for everyone. This isn’t gendered. Men and women both deal with this. Obnoxious people, especially early in one’s career, are just an occupational annoyance in software.
Where I think there is a gendered difference is in the willingness to accept that kind of social disapproval. Whether this is innate or a product of culture, I have no idea, and I don’t care to speculate on that. But the people who will tolerate frank social disapproval (e.g. being looked down upon for doing a garage startup instead of having a paying corporate job) for long periods of time seem to be men. I would argue that most people can’t deal with the false starts, confusing and extreme dimensionality, and improbability of fair evaluation in most companies that characterize the software economy. This is becoming even more true as volatile startup methodologies displace the (deeply flawed, but stable) larger corporate regime, and as tools like programming languages and databases diversify and proliferate– a great thing on the whole, but a contributor to dimensionality. People who can go through that wringer, keep their own sense of self-efficacy strong in spite of all the noise, and do all this over the several years that it actually takes to become a highly competent programmer, are very uncommon. Sociopathy is too strong a word for it (although I would argue that many such people fall under the less pejorative “MacLeod Sociopath” category) but it takes an anti-authoritarian “fuck all y’all” flair to not only keep going, but to gain drive and actually heighten performance, amid negative social signals. Like full-on sociopathy, that zero-fucks-itude exists in both genders, but seems to be more common in men. It’s a major part of why most inventors, but also most criminals, are men.
Women have, in general, equal intellectual talents to ours and I would argue that (on average) they have superior abilities in terms of empathy and judgment of character; but they don’t seem as able to tolerate long-term social disapproval. For the record, I don’t mean to imply that this is a virtue of men. Quixotry, also more often a male trait, is the dangerous and often self-destructive flip side of this. Many things bring social disapproval (e.g. producing methamphetamine) because they deserve condemnation. I’m only making an observation, and I may not be right, but I think that the essentially random social disapproval that programmers endure in their early years (and the fact that it is hard, in a massively high-dimensional space, to gain objective proof of performance that might support a person against that) is a major part of what pushes a large number of talented women to leave, or even to avoid getting involved in the first place. I also think that it is dimensionality, especially of the kind that emerges when middling programmers define assessment and parochial trivia become more important than fundamental understanding, that creates that cacophony of random snark and judgment.