Safety and security of Python and R package import into TREs#

Overview#

Summary#

Currently TREs allow access to PyPI and CRAN for less-sensitive data but only specific packages for more sensitive data. Yet there are a variety of current approaches (some TREs have CRAN access while others do not). Even though there are controls if there was a malicious python/R package, you could still just write the same thing inside the environment. It is challenging to establish the line between R & Python files and AI/ML models.

Regarding egress there are challenges around the labour intensiveness of it, for which there are some automated tools.

Next Steps#

  • Collaborate on a shared allowlist/blocklist for packages

Raw Notes#

  • Current TREs allow access to PyPI and CRAN for less-sensitive data but only specific packages for more sensitive data.

  • Different people have different experiences. Some have no access to CRAN others do

    • Scottish safe haven - no CRAN access

    • Dundee & GM allow full CRAN acceess

  • CRAN have a fairly strict pipeline for adding packages so can be trusted?

    • but perhaps just coding standards rather than pen testing, file system access etc.

  • If can lockdown egress sufficiently does it matter?

    • also need to ensure things like file access, network access etc are prohibited

    • can this be done?

  • Is there a difference between R & python files, and a large ai/ml model? Not sure there’s a clear dividing line of things we allow, and things we don’t

  • R has a system command to allow executing arbitrary code

  • If there was a malicious python/R package you could just write it inside the environment - so preventing access to libs makes it harder but not impossible to do bad things.

Egress#

  • Disclosure control labour intensive

  • Some talk of automated tools

  • Can prevent accidental disclosure

  • What about malicious attempts to extract data e.g. encrypted, embedded in image files, in binary models etc.

  • File size potentially helps

    • E.g. plausible to extract small amounts of patient data in an encrypted way that passes disclosure control. But unlikely you could do that with 1000s of records

Roadmap plan#

  • Is it possible to lock down a TRE sufficiently so it is possible to allow unlimited ingress? If so best solution as no friction for researchers. Also allows future ingress items such as LLMs / neural nets etc..

  • If not, then can TREs collaborate to whitelist (and blacklist) packages to prevent each one needing to repeat work.

    • Central register / co-ordination

    • But what to do about versioning?

  • Could have a dual model:

    1. Docker based containerised TREs that are completely locked down meaning that any ingress is allowed

    2. TREs with a list of packages that are allowed, and you need to just use those. Process to request new packages