How best to anonymise domestic solar PV power data?

jack · April 26, 2019, 7:06am

We’re talking with a company that owns lots of UK solar PV power data (some is 1-minute resolution, some is half-hourly). I’m really excited because they’re being very generous and they are considering publishing this PV power data; but only if we anonymise the data first. That is, we need to modify the data so that it’s not possible to identify which exact house each data stream comes from.

We’ve considered two approaches to this - neither approach is perfect so we’d love your thoughts:

Reduce the precision of the latitude and longitude
For example, we’ll remove the last few decimal places of the latitude and longitude, such that the resulting lat/lng is accurate to about 1 km. This is our preferred approach.

Spatially aggregate the data into 1km x 1km grid boxes
First, we’d convert the raw data from kW to ‘yield’. That is, we’ll convert to the proportion of maximum power output. Then, we’d divide the country up into 1km x 1km grid boxes. For each grid box, we’ll report the mean yield of all the PV systems within that grid box. A problem with this solution is that different PV systems have different characteristics (shading, tilt, angle, inverter power caps, etc.) So it might not be appropriate to just take the mean.

A problem with both approaches is that, in remote areas, some 1km x 1km areas might only contain a single house with PV.

A work-around for the spatial aggregation technique might be that, for any grid box containing fewer than, say, 10 houses with PV, we’d ‘merge’ that grid box with neighbouring grid boxes until the ‘merged’ box contains more than 10 houses with PV. That is, all the ‘merged’ grid boxes would be tied together, and would report the same value.

A work-around for the ‘reduced precision’ solution might be to further reduce the precision of the lat/lng in remote areas, so that the lat/lng refers to an area with multiple PV systems.

Any thoughts / concerns? Any challenges we’ve missed?

edit: Changed from “fewer than 10 houses” to “fewer than 10 houses with PV”, after Jamie’s suggestion on twitter :0

jack · April 26, 2019, 7:19am

Some great discussion on twitter:

jack · April 26, 2019, 7:29am

Chris Briggs on twitter has suggested differential privacy. I should look into this more!

edit See this 2014 paper: Geo-Indistinguishability: Differential Privacy for Location-Based Systems

danstowell · April 26, 2019, 7:59am

Your own work (on the correlation between cloud cover and power output) implies to me that any non-aggregated data could in principle be de-anonymised: if we look at the time-series of the data coming out, and we check the cloud cover pattern for any given day, (plus the PV we see from aerial images), we can infer who is who. We might even be able to get it simply from which time series lags behind which other. If anonymity is needed, IMHO aggregation is needed.

jack · April 26, 2019, 11:22am

Good points, @danstowell - TBH, I can’t think of any anonymisation technique that would allow us to completely guarantee anonymity especially, as you say, if we have historical irradiance data & a map of PV panels. Hmm…

I don’t fully understand differential privacy yet but, AFAICT, it involves adding random noise to the lat/lng. But I don’t think that guarantees privacy either. Hmm…

The only way I can think of to completely guarantee anonymity is that we don’t release the PV power data. Instead we privately use the raw PV power data to train a machine learning model (e.g. a model which maps from satellite images to estimated irradiance); and then we release the parameters of that model. But that isn’t great from an open-science perspective. Ideally, we’d want everyone in the community to have access to the same raw data.

danstowell · April 26, 2019, 11:55am

Differential privacy is basically a cleverer way to fuzz data (it provides tighter guarantees than simply adding an arbitrary amount of jitter and hoping you’ve added enough). It’s not the main consideration, unless it were possible to apply differential privacy with all the possible side-information already included. Sounds like another research project to me…

I think gridded aggregation is the best thing to do. I’m sorry that it’s not the popular option! It’s possible to make some inferences by disaggregation (!) but those inferences are not very certain, so (IMHO, again) the privacy risk is very low.

yonadav · April 26, 2019, 1:38pm

Thank you for doing this! Couple points:

The ML-model wouldn’t actually be private unless also trained with DP because it could overfit to the training data.
The US Census is using differential privacy, so I’d imagine that this company could trust it as a legitimate technology.
It’s probably not impossible to differentially-privately release this data if the insights you’re hoping to let others glean are not statistics across populations (mean generation in this tiny region at 5 PM), but rather statistics about a scenario so narrow that only a single individual satisfies it (mean generation at 5 PM for a house with these qualities [shading, tilt, …, approximate location], for which only a single match exists).
In general, this would be easier if rather than releasing raw data, you had in mind a set of summary statistics for each 10-person block that would be sufficiently useful for open-scientists to glean the necessary insights. Then you can apply DP straightforwardly to each of these statistics. (This is how the census does this.)

(Other more qualified people have volunteered, but of course feel free to dm me if I can assist further.)

dan · April 26, 2019, 2:12pm

Anonymising / anonymizing data is always hard.

The specific challenge we have here is that any anonymising of the PV locations is going to make the cloud alignment with the PV readings (target) very hard or incorrect as you start to dial up to highly granular modelling (spacial and temporal).

It would be a real pity to only be able to make compromised data available to the general population.

Here’s one possible idea: could you not do any aggregation or differential aggregation on the PV locations, but rather than passing a contiguous dataset over to the general public, slice the data set up in to spacial grids (say 100km square), and also potentially into chunks of time. You then distribute these “chunks” to the population of data scientists. Clearly there would be a little data that is effectively lost around the boundaries of the chunks, as you don’t have the adjacent points you would need for a full model, but within each chunk you have full detail.

You would need to make the chunks small enough such that reverse engineering wasn’t possible, but they are still hopefully valuable.

TylerBusby333 · April 26, 2019, 2:46pm

I can understand why a company wants to anonymize data they’re releasing, and if there’s no way to get this data without it being anonymized, I’m for anonymizing it. However, I’m just not really understanding the threat vectors of releasing the dataset publicly. It seems like the fact that a home has solar panels is public knowledge, since they’re necessarily visible from some point outside of that person’s property if they’re producing power. And for most places, publicly viewable satellite data allows you to spot installations and geolocate them fairly easily.

Perhaps the worry is that being able to accurately map PV output to a specific person/location could be used against them somehow? Moreso than just the fact that they own a PV system? But I can’t really think of any way that would be used to attack someone. Although I realize just because I can’t think of anything doesn’t mean it’s not possible.

I mostly ask because it seems like any privacy conclusions we come to here would also inform our policy for mapping solar panels. And I wouldn’t want to cause any harm with that project.

jack · April 26, 2019, 4:32pm

On the topic of not causing harm (!)…

Ultimately, all we’re doing with the PV mapping project is taking data which is already public (satellite imagery) and transforming it. So, in a sense, we’re not revealing anything new. For example, today, if you want to know if you neighbour has a large PV array then you can just look at Google Maps.

But, to make sure we keep people happy, I think we might want to consider building a simple tool to make it really easy for people to remove their PV systems from OpenStreetMap. In all likelihood, very few people will decide to remove their PV system from OSM; but just having the option to do so should give some comfort to folks who might be a bit concerned.

On the question of why we need to anonymise the PV power data… I totally agree that PV power data isn’t especially personal data. I think the challenge is that this company’s customers signed an agreement years ago; and that agreement didn’t say that their data would be shared publicly. So we either anonymise the data; or we go back to every customer and ask them to opt-in to sharing their full data. I’m exploring whether we can do both (i.e. the default is we share anonymised data; but, if customers opt-in to sharing clean data, then we’ll share their clean data).

TylerBusby333 · April 28, 2019, 9:05pm

Makes sense! Thanks for the response, I’m glad we’re on the same page.

jack · April 29, 2019, 4:14pm

Assuming we have to reduce the precision of the latitude and longitudes of each location, what’s an acceptable loss of precision? If the resulting lat/lng specifies an area of about a square km in size, is that OK?

For reference: Each pixel on the METEOSAT geo-stationary satellite image represents about 1 square km; and the Met Office numerical weather predictions are on a 2 km x 2 km grid. The next generation METEOSAT satellite (to be launched in a few years) will have resolution closer to 0.5 km; and the Met Office are likely to increase their spatial resolution again soon. (Polar-orbiting satellites provide much higher spatial resolution, but much lower temporal resolution)

JamieTaylor · April 29, 2019, 4:31pm

One (albeit quite obscure and unlikely) example attack vector would be that identifying locations/addresses of customers of the PV monitoring company in question paves the way for social engineering attacks i.e. phishing by post - attacker impersonates the company and sends snail mail asking PV owner to log in to their fake portal and update bank details. Obviously anyone could do this with mass mailouts but targeted phishing attacks are less likely to raise alarms.

At Sheffield Solar (University of Sheffield) we have used most of the techniques mentioned here at some point or another when releasing PV data to third parties. Our preferred approach is always to spatially aggregate to the point we’re confident individual system data cannot be dis-aggregated. Where we publish individual system data we either apply sufficient jitter to the system locations to be confident that the true location cannot be identified or we ensure we have explicit permission from the system owner to publish their data (very difficult and time-consuming to obtain). We have in the past shared some data with fellow academics without anonymising locations (e.g. where the application depends on accurate geospatial data), but in these cases we’ve always insisted on having a Material Transfer Agreement in place (a bit like an NDA but for one-off data transfer).

Is there an option whereby the data is not anonymised but is published under license agreement between the recipient and the company who supply the data, requiring the recipient to supply their information and signature? As with all data licensing, any mis-use of the data would be difficult to police, but at least we’d have a record of anyone who has received the data and could perhaps satisfy the company that the data is protected by contract. Then again, it would be a lot of effort to administer and very undesirable from an open source perspective.

dan · May 1, 2019, 1:00pm

Having to blur the location or aggregate the PV sites is going to put a floor on the temporal / spacial resolution we can hope to effectively model. If we need aggregate to x km squares, the resolution we can model to would be say n*x km. I don’t know what n will be, but I’m guessing around 2, or a low single digits. Probably need to test empirically to be sure… Similarly the temporal resolution will be floored, but harder to work this out exactly - obviously related to cloud speeds and rate of cloud formation.

As you say, Jack, there will be other factors which floor the resolution too, like weather resolution (I think the inner domain from Met is 1.5 km). Obviously the goal however is to model with a resolution similar at the satellite image resolution, which means the anonymization could be the binding constraint now or soon…

In terms of getting around this - it should be possible for the anonymization level (I.e. in km) should be able to be a function which varies over the UK. So in areas with a many houses or PV sites, you could keep the resolution high, but in sparce areas you will have to shift / aggregate more. This should help a little.

dan · May 1, 2019, 1:59pm

Not sure about this - website says 1.5 km, but some emails say 2km…

jack · May 1, 2019, 2:12pm

all good points!

One other thought: Even if some of our PV power data is at coarse spatial resolution, I think we should be able to get full spatial res for the ~2,500 PV systems on PVOutput.org.

SimonTemple · May 29, 2019, 10:04pm

do the people at https://maps.luftdaten.info/#13/53.3804/-1.4969 have any useful code for anonymysing?

Ish · June 27, 2019, 8:06pm

I am new to Differential Privacy but would like to apply it to this problem. Could you suggest a publicly available PV power dataset similar to that of held by this organization. Thanks.

jack · July 1, 2019, 10:43am

Welcome to the forum, Ish & Simon!

Ish, I’m not aware of any easy-to-access public PV dataset. We’re working hard to get at least one PV dataset released ASAP

In the meantime, you could try downloading data from PVOutput.org API.

jvmancuso · September 23, 2019, 5:16pm

Hi all, unfortunately just now finding this page – was concerned I didn’t have much to contribute to this effort, but now that privacy-preserving ML methods are in the mix I feel I can hopefully be more helpful.

I would say that if the end goal of using this data is more than simply interactive data analysis, with emphasis on using the data for forecasting, I think differential privacy is not the only approach that could be useful. More specifically, I would say that in the context of training machine learning models on this dataset, differential privacy would be better suited to use in combination with complementary technologies from PPML.

Is privacy still a concern here? Would be happy to go into further detail, but don’t want to go down a rabbit hole if the need is no longer there.

Topic		Replies	Views
Animation of solar PV power and clouds Solar PV nowcasting	2	695	November 4, 2019
Segmented PVs/imagery in 4 California cities: open data COG & STACified Solar PV mapping	11	769	August 28, 2019
Format for OpenStreetMap submissions Solar PV mapping	18	1501	April 23, 2019
Solar irradiance datasets Solar PV nowcasting	8	960	May 9, 2019
Human Verification Investigation / Open Climate Fix MapRoulette Project Solar PV mapping	10	872	April 8, 2019

How best to anonymise domestic solar PV power data?

Related topics