Authentication¶
Accessing LEAP Cloud Buckets¶
If you are not working on the LEAP Jupyterhub, i.e. a local machine or HPC, accessing the LEAP cloud data requires authentication. We manage bucket access directly via Google Cloud Console (GCS), on a per-user basis.
- Ensure your computer can access the Google Cloud SDK via its Terminal. Please consult the Install Instructions for guidance if you do not have gcloud on your machine. Login to whichever google account you want LEAP to grant access to with the gcloud auth login command. To ensure everything worked, verify that
gcloud auth list
returns the correct account. - Reach out to the Data and Compute team to request access. We'll need to know which account email you have logged into gcloud with as well as which bucket you need to access. Permissions will be granted based on what kind of access is needed (read-only, write, etc). Once confirmed on our end, verify which permissions were granted with
gcloud storage buckets get-iam-policy gs://<bucket-name>
under your account email. - Once you have access, you have a variety of options for actually moving the data; most tools or libraries have some way of syncing with gcloud. You may find it helpful to generate Application Default Credentials using the
gcloud auth application-default login
command. This helps applications automatically find credentials at runtime, bypassing the need for manual authentication. This will create a OAuth 2.0 Token file in some location like$HOME/gcloud/application_default_credentials.json
.
For most use cases, rclone gets the job done.
Configuring Rclone with Google Cloud¶
- Install rclone with
conda install rclone
- Run rclone config and follow the directives for the google cloud storage setup. Generally the default options for good when in doubt, but this part can be tricky! Feel free to reach out with questions if unsure. Some guidelines:
- Since IAM access directly granted, you can leave the 'Project number` blank.
- We do not anonymous access, that ruins the whole purpose of granting access! Put
false
when prompted. - object_acl specifies the default permissions for any new objects you upload to storage. We recommend choosing either 5 or 6. bucket_acl does not matter since you're unlikely to create any new cloud buckets.
- If you followed the steps above and generated Application Default Credentials, you can choose
true
(which is NOT the default) for the "env_auth" option, which tells rclone to get GCP IAM credentials from runtime or as needed. If yourgcloud auth login
session is valid, it will be used to authenticate rclone without much hassle! Especially if you are on an HPC without a web browser, this is the best option.
- Upon completion, you will have created or added to your rclone configuration file. To find out the location of your config file, you can run
rclone config file
(it is usually something like$HOME/.config/rclone/rclone.conf
). If you see the remote that was just set up, you can now use rclone freely! See our technical reference on rclone for guidance.
OSN Pod Authentication¶
Step by Step instructions
- Reach out to the the data-and-compute team. They can share bucket credentials for the
'leap-pangeo-inbox'
bucket. - How you exactly achieve the upload will depend on your preference. Some common options include:
- Load / Aggregate your NetCDF files using xarray and save the output to Zarr using
.to_zarr(...)
- Use fsspec or rclone to move an existing zarr store to the target bucket
- Load / Aggregate your NetCDF files using xarray and save the output to Zarr using
Whatever method you use, the final upload must contain one or more valid Zarr stores. These are required for compatibility with LEAP's cloud-optimized workflows. A typical workflow for the above steps might look like:
import xarray as xr
import zarr
from obstore.store import S3Store
from zarr.storage import ObjectStore
# Optional: This will start a dask distributed client for parallel processing.
client = Client()
print(client.scheduler.address)
ds = xr.tutorial.open_dataset("air_temperature", chunks={})
ds_processed = ds.mean(...).resample(...) # sample modifications to data
# define our credentials, bucket name and dataset path
DATASET_NAME = "<INSERT_YOUR_DATASET_NAME_HERE>"
osnstore = S3Store(
"leap-pangeo-inbox",
prefix=f"{DATASET_NAME}/{DATASET_NAME}.zarr",
aws_endpoint="https://nyu1.osn.mghpcc.org",
access_key_id="<ASK LEAP DCT MEMBERS FOR CREDENTIALS>",
secret_access_key="<ASK LEAP DCT MEMBERS FOR CREDENTIALS>",
client_options={"allow_http": True},
)
zstore = ObjectStore(osnstore)
# Write our dataset as Zarr to OSN
ds.to_zarr(
zstore,
zarr_format=3,
consolidated=False,
mode="w",
)
# Note: Your data can be read anywhere, by anyone!
roundtrip = xr.open_zarr(
"https://nyu1.osn.mghpcc.org/leap-pangeo-inbox/dataset-name/dataset-name.zarr",
consolidated=False,
)