# Sight unseen: how far can we go with keeping data hidden from users? ## Overview ### Summary This is the model of [OpenSAFELY](https://www.opensafely.org/). Questions explored were how to ensure that the provided metadata is sufficient, how to extend the approach to more complex data (highly relational/linked databases) and the implied need of code review before running on actual data. In summary this can be done but there are limitations. ## Raw notes - What are the advantages and disadvantages of hiding data from users? - How do we minimise barriers and frustration when working with unseen data? - Pros and cons of hiding data. Is it even worth it? - Challenge with interpretting the question - is this about restricting just identifiable information? - In what scenario would it be beneficial to keep data hidden? - Federated analytics - [OpenSAFELY](https://www.opensafely.org) model. Allows you to see data that is structured the same as the original but filled with random (synthesised?) data. - Can we provide sufficient metadata to allow for unclean or missing data? - Additional challenge with more complex data (highly relational/linked databases) - There is a need for code review before running on the original data - Who's resposibility is it to create the metadata and do the cleaning? The data provider? The TRE (probably not)? - On the question of how far we can take this: - It can be possible, but there are limitations. Including reducing the chance of the results. - Pros of hiding data: - increase trust in research - potential for higher quality research (no p-hacking, more hypothesis testing, less data mining, etc) - There are some doubts about the value/need for this. Aren't TREs with anonymised data enough? ### Roadmap plan #### Questions - What would a solution to this problem look like? - What resources would be needed (people, time, funds, infrastructure etc.)? - How can this community support you in getting them? - What working groups/orgs are already working on this, if any? How can we collaborate with them effectively? #### Notes - Something along the lines of the OpenSAFELY model could work - Requires trust in the data providers and researchers - Limitations of types of data and types of analyses - Resources required: people to do the code review step