Backdoor to Machine Learning

Mar 19

In early 2012, a Japanese video game about shooting up robots came with a highly advertised voice recognition system: you can command your computer-controlled teammates by shouting short phrases like "Charge!" or "Fall back!" While the game gave me the quintessential 2012 shooter experience, the voice recognition was quite terrible, especially for people with accents and bad microphones. Perhaps the problem was in the training data and the algorithms. Voice recognition was not very good.

Throughout the rest of 2012, deep learning became a Big Deal. The gradually developed theory of multilayer neural networks met abundant data and fast computers. In a series of papers by different groups, deep learning methods outperformed the traditional methods on many benchmark and novel tasks, from handwritten digits to voice recognition. In the coming years, enrollment of machine learning and artificial intelligence courses would go through the roof. "Data scientist" would appear as a new legitimate and arguably sexy job.

When I arrived in the US in the Fall 2012, to a Physics department of a premier university, physics was distracted. The Higgs Boson discovery at the LHC was just announced the past summer. Some people were telling stories about an earlier sneak peak at the data, others were wondering what other Higgs parameters can be found from the data, or what more general theories can we now test that predict the same Higgs mass but have other bells and whistles. Deep learning was in some other department, and I kept carefully dodging all machine learning classes in favor of statistical mechanics and networks. Until ten years later, when in Fall 2022 I found myself working in a place called the AI Institute.

Machine learning is predictive optimization. It assumes that the data is described by some big function with many parameters, and uses the dataset to fine-tune the function for maximal prediction power. I have written before about my issues with optimization as an intellectual framework. In short, optimization gets so obsessed with finding THE BEST solution that it forgets to interrogate the criteria that make solutions meaningfully different.

Looking from the outside, for years I could not understand why machine learning is necessary, why can't you weasel your way out of a problem with a clever choice of statistical tests, symmetry arguments, and old-fashioned theory. Soft matter physicists studying disordered glassy systems came up with a large zoo of structure functions to describe particle configurations, and trained machine learning models to predict particle rearrangements. I was busy learning combinatorics.

Observing my physics colleagues taking off-the-shelf machine learning methods to analyze their data led to a first constraint I set for myself: I want my models to be "interpretable". The notion of interpretation is very slippery across machine learning and has numerous definitions. Interpretation is why my first Systems Physics paper was rewritten top to bottom several times, even though it is based on a stylized model and doesn't have an ounce of empirical data. People asked me why I didn't use "machine learning" for that work, before any of us knew what machine learning means. Machine learning was associated with training "black boxes" and then "opening" them to look at the hidden layers. If it's that involved, why would I need black boxes in the first place?

In the meantime, a different black box served me news feeds and cat videos. Social media giants have all the data, all the models and compute power, and no qualms about interpretability. They care about growth and profit, which requires maximizing engagement to the detriment of platform users. YouTubers try to appease the all-powerful Algorithm that decides whether their videos would be shown and monetized, or be labeled "explicit content" for any mention of LGBTQ+ issues because the Algorithm's utopia just can't imagine LGBTQ+ people. The deployed algorithms don't even need to create content as they can just be the arbiters of user-created content. On a societal level, this corporate-algorithmic governance might spell trouble for us, unless we figure out some form of collective stewardship of the media. On a personal level, this crystallized the second constraint: I want my work to be "ethical". I just needed to find out what it means.

One systematic ethics framework that I found useful is Data Feminism, laid out in the eponymous book. On one side, we have the quantitative algorithms and numerical data that they crunch. On the other side, we have feminist epistemology that guides the practitioner to examine the power structures in the data-driven study, check the data provenance, give labor credits, and consider the impacts of model deployment.

One key idea of feminist epistemology is standpoint theory, or recognition that the picture of the world each of us has access to is the product of our social position. Everyone has a standpoint, nobody is looking from above. Some people don't have access to data and skills to analyze it, while other people have that access through their social position. Data access is thus a question of power dynamics in relation to the process the data describes. Data is a newly empowered form of capital, mutually reinforcing with familiar social, financial, and political capitals.

Having access to data within the scope of a particular project thus gains some sanctity: you were either given with resources to gather data (ranging from running some massive supercomputer simulations to staying at a remote field site), or given trust and goodwill of other people who got the data. We also know that any raw data is a "diamond in the rough", it has the potential to be useful but only after it is "cleaned". Cleaning is, in many eyes, the domain of some lower-paid, less-necessary people, the digital janitors. That is, the very people without whom we will not be able to run our fancy algorithms. Without janitors and custodians our offices would be overflowing with filth of our own production.

A common question somewhere between a technical interview and a data bro sales pitch is "If I give you a spreadsheet of 10k lines, what tools would you first pull out to tell me something about this dataset? Ok what if it were 100k lines?" I get it, there is probably a fancy tool that runs quick off-the-shelf visualizations and summary statistics of the dataset. But my first question would be different. Who gathered this dataset? What questions were they asking with this data? How did you get your hands on this dataset? The questioning of data provenance, metadata, data sheets is what allows us to stay grounded in the research. If you are not allowed to touch some data, or if such data is not being gathered at all, perhaps there is a social rather than technical reason at play which needs to be addressed first, before drowning interviewees in spreadsheets.

Bragging about 100k line spreadsheets also serves to reassert the supremacy of machine learning methods, the proud rejection of domain knowledge, the irrelevance of all the work that came before you. Papers nagging about the "End of Theory" in the age of machine learning just try to build credence for the author's takeover of a scientific domain. At the same time, careful statisticians argue how big data increases your confidence in the answers (hey, paradise!), and makes your answers absolutely spectacularly wrong (wow, paradox?). For badly sampled data, the error doesn't decrease with square root of sample size, but grows with square root of population size! Try to end that theory!

Access to the data, labor credit, power dynamics, domain expertise are not questions that aim to slow down data science "progress": they try to make sure data is used "for good". Handy color-coded maps that guided mortgage lending across many US cities throughout the 20th century were an example of a data-driven method guiding policy: except for they were not used for good. They were used for redlining: systematic denial of home ownership and wealth building to the Black Americans. There are many other examples of data science, AI/ML used for evil. If you need to flaunt your credentials to shield a colleague with whom you work on predictive policing, maybe the answer is not in math and data.

So why did I end up in the AI Institute? How can I touch machine learning and not break my two rules? I found the research space working on machine learning for engineering, not the software kind, but the old-fashioned kind. Across fluid flows, elastic deformations, predator-prey oscillations, and other physical systems, we often obtain data before we ask questions, without ethical qualms of human subject data. From this data, we want to find the correct governing equations, and the correct reduced coordinates in which the equations look the nicest.

My question is how do we identify model structures that don't require immediate retraining? How do we find generality, interpretability of the model that already has predictive power? Under-the-hood mathematics puts model parameters and datasets on equal footing. In statistical mechanics terms, the data is like an external field acting on the model parameters that may or may not condense the model into a particular state. And with statistical mechanics, we can swim in the datasets and argue about their broad properties.

Of course, I found my own backdoor into the machine learning space. There are many other kinds of backdoors. And if you are fully committed to learning this stuff and have nothing to lose, the front door is wide open. Read a book, take a free course, watch some Eigensteve lectures. Machine learning ideas are useful to keep in your arsenal, but they will not give you an epistemology, or override the domain expertise of every single branch of science.

This would have been the happy ending of the story if not for the Large Language Model invasion of the last few months. The same people who usually work on careful and interpretable models are suddenly happy to take sightseeing advice from an oracle, and to learn pseudo-authentic world history from a high-volume bullshit-spewing machine. The Ethical AI team at Google and their colleagues saw all these problems two years ago and wrote a famous exposé paper. That is why Google does not have an Ethical AI team anymore. That is why development and hype of LLMs is built, at its core, on anti-ethical foundations. Anything goes to reach their longterm goal. Perhaps, in our immediate future dealing with AI would again require shooting at robots.

machine learning

Andrei Klishin

Backdoor to Machine Learning

Pan demos aeras

Bicycle