A few days ago a prominent economist hereafter (PE) I follow on twitter and enjoy bantering with mentioned, en passant, that PE had used ChatGPT to submit a referee report. (I couldn't find the tweet, so that's why I am not revealing his identity here.) Perhaps PE did so in the context of me entertaining the idea to use it in writing an exam because I have been experimenting with that. I can't say yet it is really a labor saving device, but it makes the process less lonely and more stimulating. The reason it is not a labor saving device is because GPT is a bullshit artist and you constantly have to be on guard it's not just making crap up (which it will then freely admit).
I don't see how I can make it worth my time to get GPT to write referee reports for me given the specific textual commentary I often have to make. I wondered -- since I didn't have the guts to ask PE to share the GPT enabled referee report with me -- if it would be easier to write such a joint report in the context of pointing out mistakes in the kind of toy models a lot of economics engages in. This got me thinking.
About a decade ago I was exposed to work by Allan Franklin and Kent Staley (see here) and (here). It may have been a session at a conference before this work was published; I don't recall the exact order. But it was about the role of incredibly high statistical significance standards in some parts of high energy physics, and the evolution of these standards over time. What was neat about their research is that it also was sensitive to background pragmatic and sociological issues. And as I was reflecting on their narratives and evidence, it occurred to me one could create a toy model to represent some of the main factors that should be able to predict if an experimental paper gets accepted, and where failure of the model would indicate shifting standards (or something interesting about that).
However, back home, I realized I couldn't do it. Every decision created anxiety, and i noticed that even simple functional relations were opportunities for agonising internal debates. I reflected on the fact that while I had been writing about other people's models and methods for a long time (sometimes involving non-trivial math), I had never actually tried to put together a model because all my knowledge about science was theoretical and self-learned. I had never taken graduate level science courses, and so never had been drilled in the making of even basic toy-models. This needn't been the end of the matter because all I needed to do was be patient and work through it in trial and error or, more efficiently, find a collaborator within (smiles sheepishly) the division of labor. But the one person I mentioned this project to over dinner didn't think it was an interesting modeling exercise because (i) my model would be super basic and (ii) it was not in his current research interests to really explore it. (He is super sweet dude, so he said it without kicking down.)
Anyway, this morning, GPT and I 'worked together' to produce Model 7.0.
Model 7.0 is a toy model that represents the likelihood a scientific paper is accepted or rejected in a field characterized as 'normal science' based on several factors. I wanted to capture intuitons about the kind of work it is, but also sociological and economic considations that operate in the background. I decided the model needed to distinguish among four kinds of 'results':
- Ri: replications
- Rii: results that require adjustment to the edges of the background theory;
- Riii: results that confirm difficult to produce predictions/implications of the background theory;
- Riv: results that refute central tenets of the background theory (so called falsifications).
The model is shaped by the lack of interest in replications. Results that refute central tenets of a well confirmed theory need to pass relatively high evidential treshholds (higher than in Rii and Riii). I decided that experimental cost (which I would treat as a proxy for difficulty) would enter into decisions about relative priority of Rii and Riii. In addition, there were differences of standards in fields defined by robust background theories (or not so robust). As robustness goes up baseline standards of significance that need to be met, too.
I thought it clever to capture a Lakatosian intuition that results in a core and result on a fringe are treated differently. And I also wanted to capture the pragmatic fact that journal space is not unlimited and that its supply is shaped by the number of people chasing tenure.
So, in model 7.0 we can distinguish among the following variables
- Θ represents the centrality of the topic (Theta).
- Σ represents the statistical significance of the results (Sigma).
- Ρ represents the robustness of the background theory (Rho).
- α represents the journal space availability (Alpha).
- N the number of people seeking tenure,
- C the cost of the experiments
I decided that the relationship between journal space and number of people seeking tenure would be structurally the same in all four kinds of results. As the number of people seeking tenure in a field (N) increases, journal space availability (α) also increases, leading to a decrease in the importance of the other three variables (Θ, Σ, Ρ) in determining the probability of acceptance (P(Accepted) and so this became 'α / (α + k * N)' in all the functions. k represents the rate of decrease in significance threshold with increase in number of replications, which GTP thought clever to include. (Obviously, if you think that in a larger field it gets harder to publish results at a threshhold you would change signs here, etc.)
The formula for accepting papers for four different kind of results in a given field with a shared backgrond theory is given as follows:
- P(Accepted | Ri) = f(Θ, Σ, Ρ, α, Ri) = f(Θ, Σ, Ρ, α / (α + k * N), Ri)
- P(Accepted | Rii) = f(Θ, Σ, Ρ, α, Rii) = f(Θ, Σ, Ρ, α / (α + k * N), Rii) * h(C, Rii)
- P(Accepted | Riii) = f(Θ, Σ, Ρ, α, Riii) = f(Θ, Σ, Ρ, α / (α + k * N), Riii) * h(C, Riii)
- P(Accepted | Riv) = f(Θ, Σ, Ρ, α, Riv) = f(Θ, Σ, Ρ, α / (α + k * N), Riv) * g(Σ, Ρ, Riv)
For Rii and Riii, the formula also includes an additional term h(C, Rii) or h(C, Riii) respectively, which represents the effect of cost on the likelihood of acceptance. The function h is such that as costs go up, there is a greater likelihood that Riii will be accepted than Rii. We are more interested in a new particle than we are in a changing parameter that effects some measurements only. For Riv, the formula includes an additional term g(Σ, Ρ, Riv), which represents the impact of the robustness of the scientific theory being tested and the quality of the experimental design on the likelihood of acceptance. The standard for Riv is much higher than for Rii and Riii, as it requires an order of magnitude higher level of statistical significance.
It is not especially complicated to extend this kind of model to capture the idea that journals with different prestige game relative acceptance rates, or that replications (Ri) with non trivial higher statistical power do get accepted (for a while), but that's for future work.:)
I decided to pause at Model 7.0 because GPT seemed to be rather busy helping others. I was getting lots of "Maybe try me again in a little bit" responses. And I could see that merely adding terms was not going to be satisfying. I needed to do some thinking about the nature of this model and play around with it. Anyway, GPT also made some suggestions on how to collect data for such a model, how to operationalize different variables, and how to refine it, but I have to run to a lunch meeting. And then do a literature survey to see what kind of models are out there. It's probably too late for me to get into the modeling business, but I suspect GPT like models will become lots of people's buddies.
Comments
You can follow this conversation by subscribing to the comment feed for this post.