In response to listener feedback, we’re delving a bit deeper into the topic of
artificial intelligence machine learning and its impact on society and our political structures.
On the second episode of The Private Citizen for this week, I turn to some very thoughtful producer feedback that I’ve recently received.
I will be very busy soon, so I won’t be able to release an episode in the coming week. But I am planning to release two episodes in the week following this short hiatus.
Feedback on My Coverage of AI
Bennett wrote me this nice email:
Sorry that I haven’t written in a while. There’s been a lot going on at work and I’m currently buying an apartment, so I took a break from listening for about half a year. I thoroughly enjoyed episodes 129 and 130, but I agreed too much to bother writing in.
I wanted to write in about episode 131, because I thought you were a little unclear on some definitions. Note that I don’t consider myself an expert on machine learning, but I did take some graduate-level courses and my dad is an expert in the field.
I think this is a very good idea! I am, of course, also far from an expert in the field and I appreciate the insight. Which is why I thought I’d make Bennett’s feedback the centre of its own episode.
You are correct that no one in research uses the term “AI”, that’s a marketing term which can be applied to anything. The insider term for the interesting stuff (as opposed to programming) is “machine learning”. There are actually many different approaches here. If you read a few paragraphs of the Wikipedia articles in the following list, you’ll have a good overview:
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data.
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications. This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
→ Decision Trees (these you can imagine as “nested if-statements”)
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression.
k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically.
Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.
The unsupervised k-means algorithm has a loose relationship to the k-nearest neighbor classifier, a popular supervised machine learning technique for classification that is often confused with k-means due to the name. Applying the 1-nearest neighbor classifier to the cluster centers obtained by k-means classifies new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.
In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point, because this is the direction of steepest descent. Conversely, stepping in the direction of the gradient will lead to a local maximum of that function; the procedure is then known as gradient ascent.
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other approaches. Monte Carlo methods are mainly used in three problem classes: optimization, numerical integration, and generating draws from a probability distribution.
In physics-related problems, Monte Carlo methods are useful for simulating systems with many coupled degrees of freedom, such as fluids, disordered materials, strongly coupled solids, and cellular structures. Other examples include modeling phenomena with significant uncertainty in inputs such as the calculation of risk in business and, in mathematics, evaluation of multidimensional definite integrals with complicated boundary conditions. In application to systems engineering problems (space, oil exploration, aircraft design, etc.), Monte Carlo–based predictions of failure, cost overruns and schedule overruns are routinely better than human intuition or alternative “soft” methods.
Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. It is often used when the search space is discrete (for example the traveling salesman problem, the boolean satisfiability problem, protein structure prediction, and job-shop scheduling). For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, simulated annealing may be preferable to exact algorithms such as gradient descent or branch and bound.
The name of the algorithm comes from annealing in metallurgy, a technique involving heating and controlled cooling of a material to alter its physical properties. Both are attributes of the material that depend on their thermodynamic free energy. Heating and cooling the material affects both the temperature and the thermodynamic free energy or Gibbs energy. Simulated annealing can be used for very hard computational optimization problems where exact algorithms fail; even though it usually achieves an approximate solution to the global minimum, it could be enough for many practical problems.
The problems solved by SA are currently formulated by an objective function of many variables, subject to several constraints. In practice, the constraint can be penalized as part of the objective function.
In computational intelligence (CI), an evolutionary algorithm (EA) is a subset of evolutionary computation, a generic population-based metaheuristic optimization algorithm. An EA uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators.
Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolutionary processes and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor. In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problems; therefore, there may be no direct link between algorithm complexity and problem complexity.
Neural networks are on the same “level” as everything I have just listed, but much of what you talked about only applies to them. For example, some of these require a decent understanding of how to model the underlying problem. Some are not even far away from literally modelling physics equations in software, but are used when we either don’t have a full understanding of the problem, or exact calculations would be too expensive. They often do not have the weird edge cases we see with NNs. They also mostly work with small amount of training data, in exchange for understanding of the problem space. Crucially, many of these approaches are very much explainable, and they may yield predictable outcomes once deployed in the field.
Neural networks tend to be used when we no longer have an understanding of how to solve the underlying problem, and indeed amounts to brute force. They do work very well in areas where the above approaches fail (especially image recognition, which caused their revival ~15 years ago - they had been discovered, and abandoned, before the AI Winter).
In the history of artificial intelligence, an AI winter is a period of reduced funding and interest in artificial intelligence research. The term was coined by analogy to the idea of a nuclear winter. The field has experienced several hype cycles, followed by disappointment and criticism, followed by funding cuts, followed by renewed interest years or even decades later.
The term first appeared in 1984 as the topic of a public debate at the annual meeting of AAAI (then called the “American Association of Artificial Intelligence”). It is a chain reaction that begins with pessimism in the AI community, followed by pessimism in the press, followed by a severe cutback in funding, followed by the end of serious research. At the meeting, Roger Schank and Marvin Minsky – two leading AI researchers who had survived the “winter” of the 1970s – warned the business community that enthusiasm for AI had spiraled out of control in the 1980s and that disappointment would certainly follow. Three years later, the billion-dollar AI industry began to collapse.
Hype is common in many emerging technologies, such as the railway mania or the dot-com bubble. The AI winter was a result of such hype, due to over-inflated promises by developers, unnaturally high expectations from end-users, and extensive promotion in the media. Despite the rise and fall of AI’s reputation, it has continued to develop new and successful technologies. AI researcher Rodney Brooks would complain in 2002 that “there’s this stupid myth out there that AI has failed, but AI is around you every second of the day.” In 2005, Ray Kurzweil agreed: “Many observers still think that the AI winter was the end of the story and that nothing since has come of the AI field. Yet today many thousands of AI applications are deeply embedded in the infrastructure of every industry.”
Your overall assessment and conclusions however are in no way impeded by these subtleties, I think you did a great job of grasping the inherent limitations and problems. And especially the societal implications, which we should definitely not leave to the experts anymore than we should have during the COVID pandemic. Our language provides the very fitting term “Fachidioten” for me to insert here.
So, many of the technical limitations you discussed only apply to neural networks. An important point which you briefly came close to discussing is the fundamental limitation of the entire field, which I’m sure will interest you on a philosophical level: Nothing in the field is capable of generating new causality information: these systems cannot learn why, only what. Sometimes we need to provide that, and sometimes it’s bypassed – ethically, amounting to nothing more than fancy statistics (!).
Basically, either we provide causality information and the algorithm figures out a way to work with that, or we bypass it and do pattern matching on observations. I know that you know enough to see the implied ethical problems! If this interests you, I can recommend this great book:
- The Book of Why: The New Science of Cause and Effect, Judea Pearl & Dana Mackenzie
The first chapter alone is worth a read.
Anyway, thank you for your podcast. We disagree on several issues, we’re closer on others, but we always meet at careful thinking and learning. Keep asking the next question!
Additional Producer Feedback
Yes, there’s a lot of hype around individuals. Also, you raise a correct point regarding the pros and cons of employment laws.
I’m now learning more towards smaller government, which means lower controls which might mean less “humane”, but for me the question is always: what will be the cost of the increased controls.
Regarding episode 131, he also said in an email:
You correctly identified that we are going towards a possibly dangerous future because of technology. I have always been a technical person and I enjoyed the hacker mentality.
You spoke about the caste system, and this reminded me of Cyberpunk literature where you have savvy antagonists using their knowledge to gain an edge in life. The vision of course is usually dystopian, so of you are not like that, then you are part of the amorphous masses controlled by The Machine (not literally, but the government systems, the elite, …whatever). And people are only subjects, (victims), but Cyberpunks actually understand (or try to), that there’s something behind this machine, moving parts that can and are being manipulated, each for his own goals!
Food for thought!
There’s more feeback from producers in my various inboxes, but I will keep that for the next episode or two.
If you have any thoughts on the things discussed in this or previous episodes, please join our forum and compare notes with other producers. You can also contact me in several other, more private ways.
If you are writing in from Russia, you might want to use my whistleblower contact form.
Toss a Coin to Your Podcaster
I am a freelance journalist and writer, volunteering my free time because I love digging into stories and because I love podcasting. If you want to help keep The Private Citizen on the air, consider becoming one of my Patreon supporters.
You can also support the show by sending money to via PayPal, if you prefer.
This is entirely optional. This show operates under the value-for-value model, meaning I want you to give back only what you feel this show is worth to you. If that comes down to nothing, that’s OK with me. But if you help out, it’s more likely that I’ll be able to keep doing this indefinitely.
Thanks and Credits
I’d like to credit everyone who’s helped with any aspect of this production and thus became a part of the show. I am thankful to the following people, who have supported this episode through Patreon and PayPal and thus keep this show on the air:
Sir Galteran, Rhodane the Insane, Steve Hoos, Butterbeans, Michael Small, 1i11g, Jonathan M. Hethey, Michael Mullan-Jensen, Jaroslav Lichtblau, Dave, Sandman616, Jackie Plage, ikn, Bennett Piater, Rizele, Vlad, avis, Joe Poser, Dirk Dede, IndieGameiacs, Fadi Mansour, Kai Siers, David Potter, Cam, Mika, MrAmish, Robert Forster, Captain Egghead, krunkle, RJ Tracey, Rick Bragg, RikyM, astralc, Barry Williams, Jonathan, Superuser, D and Florian Pigorsch.
Many thanks to my Twitch subscribers: Mike_TheDane, jonathane4747, mtesauro, Galteran, l_terrestris_jim, pkeymer, BaconThePork, m0dese7en_is_unavailable, redeemerf and Stoopidenduser.
I am also thankful to Bytemark, who are providing the hosting for this episode’s audio file.
The show’s theme song is Acoustic Routes by Raúl Cabezalí. It is licensed via Jamendo Music. Other music and some sound effects are licensed via Epidemic Sound. This episode’s ending song is Fight to Win by Sven Karlsson.