Safety and security of Python and R package import into TREs#

Overview#

Summary#

Currently TREs allow access to PyPI and CRAN for less-sensitive data but only specific packages for more sensitive data. Yet there are a variety of current approaches (some TREs have CRAN access while others do not). Even though there are controls if there was a malicious python/R package, you could still just write the same thing inside the environment. It is challenging to establish the line between R & Python files and AI/ML models.

Regarding egress there are challenges around the labour intensiveness of it, for which there are some automated tools.

Next Steps#

Collaborate on a shared allowlist/blocklist for packages

Raw Notes#

Current TREs allow access to PyPI and CRAN for less-sensitive data but only specific packages for more sensitive data.
Different people have different experiences. Some have no access to CRAN others do
- Scottish safe haven - no CRAN access
- Dundee & GM allow full CRAN acceess
CRAN have a fairly strict pipeline for adding packages so can be trusted?
- but perhaps just coding standards rather than pen testing, file system access etc.
If can lockdown egress sufficiently does it matter?
- also need to ensure things like file access, network access etc are prohibited
- can this be done?
Is there a difference between R & python files, and a large ai/ml model? Not sure there’s a clear dividing line of things we allow, and things we don’t
R has a system command to allow executing arbitrary code
If there was a malicious python/R package you could just write it inside the environment - so preventing access to libs makes it harder but not impossible to do bad things.

Egress#

Disclosure control labour intensive
Some talk of automated tools
Can prevent accidental disclosure
What about malicious attempts to extract data e.g. encrypted, embedded in image files, in binary models etc.
File size potentially helps
- E.g. plausible to extract small amounts of patient data in an encrypted way that passes disclosure control. But unlikely you could do that with 1000s of records

Roadmap plan#

Is it possible to lock down a TRE sufficiently so it is possible to allow unlimited ingress? If so best solution as no friction for researchers. Also allows future ingress items such as LLMs / neural nets etc..
If not, then can TREs collaborate to whitelist (and blacklist) packages to prevent each one needing to repeat work.
- Central register / co-ordination
- But what to do about versioning?
Could have a dual model:
1. Docker based containerised TREs that are completely locked down meaning that any ingress is allowed
2. TREs with a list of packages that are allowed, and you need to just use those. Process to request new packages