Automated forecast benchmarking

From my experience in this field, there is a distinct lack of reproducibility and benchmarking capability of new method submissions for forecasting.

Typically, a new submission has standard data requirements of things like temperature, cloud cover, and other variables. They tend to be predictably from renalyses (MERRA2, ERA5) or NWP.

Many of these methods lack reproducibility or trust, and therefore impact. There is simply not a robust enough method of testing. The standard way of benchmarking is to calculated the skill score and compare it to smart persistence (persistence of the clear-sky index to avoid diurnal irradiance patterns). Whilst smart persistence is straightforward, the testing phase is usually at a singular location. There is no saying how well it would do elsewhere, nor if compared to all other methods.

Therefore, I think there is an interesting topic or potential that mainly requires computer skill rather than any science knowledge.
The desire is that, when a researcher believes they have developed the ideal solution, they can submit the code to the OCF benchmarking repository. Two things must be guaranteed at this point (1) security, that the code will not be made public or shared before publication, and (2) an entirely unbiased benchmarking evaluation
Once uploaded, the OCF benchmarking tests the code against a multitude of locations and interesting situations.
Then, it returns metrics of scores, and perhaps even standardised graphics of the preferred style.

Once published, the code is then freely available to all for download.

The obvious challenges are:

  1. having all the data required to perform the diverse mix of approaches.
  2. providing a standard code framework template and debugging so that the researcher can produce a workable method that will actually complete when submitted.
    2.1 I expect this would be some python script that loaded in the required packages, and then showed how you access some/all of the variables, and then shows what return should be provided preceded by copious assertions to ensure the actual output.
  3. Ensuring that our benchmark is up to scratch!

The hope would be that OCF would then be the go-to location for forecasting.
I have collaborations with probably the best in this space who could easily provide info/guide on 1 and 3, and perhaps even contribute directly.

There is a possibility that I can dedicate funds next year to making this happen from my new post. But nothing is confirmed yet.

1 Like

Hi Jamie,

I really like this idea; and we should definitely talk more about it.

Making it easy to compare different forecasting approaches is something that’s close to my heart! Back when I was doing my postdoc on disaggregation, I thought a lot about how to run a disaggregation competition; and the open-source tool that I helped write, NILMTK, was all about trying to make NILM research more reproducable. And, just yesterday, I had a great conversation with Jeremy Freeman, who has done amazing work running ML competitions for biology.

Running other peoples’ code puts a lot of responsibility on OCF, unfortunately, and so probably isn’t practical unless we can fund a team dedicated to this task. Even if participants put their code into a container, there’s still far too much that can go wrong. When Jeremy first started running competitions, he asked people to submit their code, and he quickly found that it just wasn’t practical. And, back when I was thinking about a disaggregation competition, early on we decided that it wouldn’t be practical for people to upload their code. There are also IP issues.

That said, there are subtle ways round this problem. One might be that we provide, for example, a hosted Jupyter notebook which loads the data, runs a forecasting model, and then computes some metrics (all executed in the cloud). Participants would fork this notebook, and insert their models, and the notebook would compute the metrics. This would put the onus on the participant to debug their code.

Alternatively, we could do the standard ML competition thing of asking participants to run their models themselves (perhaps using some code we write which computes a suite of metrics), and upload their results.

Having said all this, I’m afraid it’s unlikely that we’ll be able to focus on benchmarking for the next, say, 12 months. We’ll probably need to focus entirely on PV nowcasting (because we might be funded by organisations who are expecting us to make progress on PV nowcasting!). But I’d love to help build a benchmarking system if we have the bandwidth.

And, my hope is that we can start by “simply” releasing lots of data and some scripts to compute standard metrics; and that may be sufficient to begin pulling the community towards a standard benchmarking approach.

I see. Yeah, ideally this could be something hosted that requires no further input from an employee/volunteer once launched. That side of things is where your expertise comes in.

The data requirements for nowcasting and these ML forecasting techniques are the same, so I think solving the nowcasting first just opens the door for this at a later date.

1 Like

The data requirements for nowcasting and these ML forecasting techniques are the same, so I think solving the nowcasting first just opens the door for this at a later date.

Great point - I totally agree :slight_smile:

Awesome discussion, and hi all! Just echoing @jack’s comments, a key learning for us from doing ML benchmarking in biology has been that producing standardized outputs locally (with tools we provide) and uploading results is a much lower barrier for entry, and simpler to maintain. Here’s a link to a challenge we ran this way called “neurofinder” ( Code for the site / system is open-source and would be quite easy to fork and modify :slight_smile:

Other thoughts

  • The hardest part in all challenges for me has been getting shared agreement on definition of ground truth and metrics, something to think about early
  • A single prize / deadline can incentivize submissions, but participation often drops off afterwards, compared to a rolling challenge that builds and tracks community progress over time
  • Metrics around model complexity / interpretability would be cool to consider
  • Some folks involved in COCO ( have relayed to me the value of linking a challenge to some kind of regular in-person meeting
  • If we do want to go the cloud Jupyter route, the pangeo folks ( would surely have ideas
1 Like

This is really interesting.

  • The hardest part in all challenges for me has been getting shared agreement on definition of ground truth and metrics, something to think about early

I think that this is more manageable in solar applications. There are many publications on the topic.

I am nut sure starting with a prize is the right way to go. I think publishing the approach in the relevant journals as a short communication would be a good place to start. Having the testing scenario approved by the key players in the field is also very interesting. Fortunately, the names and places have been really interestingly discovered:
The author of this Journal, Dazhi Yang, has expressed his keen interest to be involved in this discussion. Therefore, I believe all the expertise, impact and impetus can be harnessed.
Once established in a few papers, and perhaps if OCF has more market or sponsorship, there could be a more appropriate time in the future to do a competition. Intriguing.

1 Like