Rubberducking with Claude

Recently, I had the task of making an update to make a new plot of genomic data from a research project. The code is written in R, and the plot uses ggplot2 for the actual plot.

Last time I looked at the code was about 6 months ago, so it was not entirely fresh. The plot that should be added was to show the relative distribution of data within a certain type of genomic data (Eukaryotic photoautotrophs), grouped by location and time period. There was already a plot that showed the relative distribution of eukaryotic photoautotrophs as part of the whole eukaryote community, so most of the work was already done before.

Starting the work

The relative distribution (referred to as relative abundance) is displayed as a percentage. The percentage of eukaryotic photoautotrophs in relation to all of the eukaryote was about 2%. Thus, I wanted to take the data for this 2% subset, and within that subset show the relative distribution, where the total of the subset would amount to 100%.

My first thought was that I just sum all the abundance values for the eukaryotic photoautotrophs, and divide these values with that sum. This is quite simple to do in R. The result was not good at all, and this was because I had not considered the grouping of data based on location and time period. So instead of a single sum, I would have calculate a sum of abundance for each combination of location and time period, and divide the abundance values in those same groups.

This is still pretty easy to do in R. I decided, though, to have a dialogue with Claude just to see if I would make any more mistakes and if it would catch it. I described what I wanted to accomplish, and narrowed it down to the essential data transformation needed before plotting the data. It essentially boils down to something like this:

plot_data <- previous_plot_data |>
  dplyr::group_by(location, time_period) |>
  dplyr::mutate(abundance = abundance/sum(abundance)) |>
  dplyr::ungroup()

The original plot data is represented as a dataframe, which is a bit like an Excel sheet of data. It contains multiple named columns, and there are many rows of data. R is a language that has built-in support to work with such dataset, there is no need for any explicit loops to iterate over each value involved. So this code could be operating on massive amounts of data. This is not an R tutorial, but essentially the code takes the original data, adds grouping properties based on location and time period. The next mutation step operates on each group of the data, calculating the sum of the abundance values and dividing all these values with the sum, and storing that back in the same column. At last it removes the grouping properties, returning the original structure of the data.

Rubberducking

The result looked quite good, but I was not sure if I maybe missed something. The original data processing was also filtering to only include entries that were larger than 0.1% of the of the data, would I miss something with this change? I expressed my concern to Claude and gave it more of the code that did the processing of the source data to the point were the data was ready for plotting.

Claude happily said that it saw there was a problem and proposed a solution, which included filtering away more records from the dataset. My concern here was not that there were too much data, but that I might have excluded some data prematurely.

I explained to Claude that I thought the proposed solution was all wrong and that it was doing something completely opposite.

The response was that it was so sorry, and that it would come up with a new set of code, which it did. This code was essentially the original change that I did, and I told it so.

The response was that I was right and outlined a few points on why I was right, and doing things in the right order.

This helped my own thought process for the problem and that my approach did give the expected result.

Final thoughts

This was a small and short exercise with Claude, but it highlighted a few points for me about rubberducking with an LLM:

The idea of explaining a problem to a rubber duck is that the process of explaining the problem will help your own thinking to solve a problem._LLMs can be that rubber duck, if you treat it like that and not something that will solve your problem for you_.
Natural language is imprecise. Regardless of whether it is a human, LLM, or rubber duck, your communication may not be as clear as you think it is.
You need to understand the domain and context. If I did not understand the potential issue, I might just have gone with the faulty solution proposed by Claude.
Rephrased problems and solutions can help your thinking. Seeing Claude’s faulty solution and why it considered this good, and the summary of why my solution actually worked, helped me to organize my own thinking of the problem.

I may continue to use Claude as a digital rubber duck, and I think it works to help me get started with solving a problem.

But I do not trust it to solve any more significant problems by itself, if nothing else because I do not trust myself to be able to express it precisely enough in natural language. Also, my own thoughts may not yet be clear enough, yet, about the problem itself.

(Picture by Anthony Dalesandro: https://www.pexels.com/sv-se/foto/livfullt-gummianka-splash-evenemang-i-chicago-33402837/)