Resource Manager

With ResourceManager VASCA provides a utility that helps managing the raw input data. Data volumes processed by VASCA are generally pretty large and use cases as well as computation and storage resources can vary. ResourceManager adds an abstraction layer that is flexible enough to varying contexts while exposing a consistent API to the rest ofVASCA’s pipeline functions.

As an example, the processing of GALEX data for the proof-of-principle study was done by downloading raw data from MAST to a local directory, running the pipeline on a regular office laptop. This directory was then cloud-synced via DESY’s NextCloud service to allow collaborative work with multiple users on the same dataset.

Another use case is the unit testing for this package as well as this tutorial, wich should both work in a development environment and the GitHub continuous integration workflows.

Configuration

Configuration files are used to specify file locations, environment variables and even specific data products that are relevant for the processing of a specific instrument’s raw data. These can be freely edited by users to include data locations items as the use case requires. ResourceManager has the necessary consistency checks to warn if any miss-configuration has happened. So try it out!

.env

Text file located at the root directory of the package. This is read by the resource manager at initialization which uses dotenv to set the environment variables temporarily during run time. See .env_template when using VASCA for the first time.

resource_envs.yml

Configuration file specifying the required environment variables and associated attributes like a name, a project name and a short description to help other users to understand what variable is used for.

resource_catalag.yml

Configuration file that associates directory or file items to specific environment variables. Each item has a name, description, type, and path attribute.

Note

The YAML configuration files are stored under the vasca module in a subdirectory named resource_metadata

Example

Initialize the ResourceManager and see what metadata it parsed from the config files.

from pprint import pprint
from vasca.resource_manager import ResourceManager

rm = ResourceManager()
# Resource item catalog
pprint(rm.metadata["catalog"], sort_dicts=False)
Hide code cell output
{'sas_cloud': {0: {'name': 'gal_visits_list',
                   'description': 'Complete list of all GALEX visits with NUV '
                                  'exposure.',
                   'type': 'file',
                   'path': '/GALEX_visits_list/GALEX_visits_list_qualvars.fits'},
               1: {'name': 'gal_fields',
                   'description': 'Collection of GALEX dataproducts for fileds '
                                  'of interest.',
                   'type': 'directory',
                   'path': '/GALEX_fields'},
               2: {'name': 'gal_gphoton',
                   'description': 'Collection of GALEX gphoton runs.',
                   'type': 'directory',
                   'path': '/GALEX_gPhoton'},
               3: {'name': 'gal_visits_list_qualsel',
                   'description': 'Complete list of all GALEX visits.',
                   'type': 'file',
                   'path': '/GALEX_visits_list/GALEX_visits_list_qualsel.fits'}},
 'lustre': {0: {'name': 'gal_ds_visits_list',
                'description': 'Complete list of all GALEX drift-scan visits.',
                'type': 'file',
                'path': '/GALEX_DS_GCK_visits_list/GALEX_DS_GCK_visits_list.fits'},
            1: {'name': 'gal_ds_fields',
                'description': 'Collection of GALEX drift scan dataproducts',
                'type': 'directory',
                'path': '/GALEX_DS_GCK_fields'},
            2: {'name': 'gal_gphoton',
                'description': 'Collection of GALEX gphoton runs.',
                'type': 'directory',
                'path': '/GALEX_gPhoton'}},
 'vasca': {0: {'name': 'test_resources',
               'description': 'Data used for VASCA development',
               'type': 'directory',
               'path': '/vasca/test/resources'},
           1: {'name': 'gal_visits_list',
               'description': 'Complete list of all GALEX visits with NUV '
                              'exposure.',
               'type': 'file',
               'path': '/vasca/test/resources/GALEX_visits_list.fits'},
           2: {'name': 'docs_resources',
               'description': 'Data used for VASCA documentation',
               'type': 'directory',
               'path': '/docs/tutorial_resources'}}}
# Resource environment variables
pprint(rm.metadata["envs"], sort_dicts=False)
Hide code cell output
{'VASCA_DEFAULT': {'storage': 'vasca',
                   'project': 'vasca',
                   'description': 'Default environment variable for data '
                                  'management in the VASCA package',
                   'set': True,
                   'path': '/home/runner/work/vasca-mirror/vasca-mirror'},
 'UC_SAS_VASCARCAT': {'storage': 'sas_cloud',
                      'project': 'vascarcat',
                      'description': 'Resources for the UV variability catalog '
                                     'project on DESY Sync & Share. Remote '
                                     'directory: '
                                     'ULTRASAT-data/uc_science/vascarcat',
                      'set': True,
                      'path': '/dev/null'},
 'UC_LUSTRE_VASCARCAT': {'storage': 'lustre',
                         'project': 'vascarcat',
                         'description': 'Resoruces for the UV variability '
                                        'catalog project on DESY LUSTRE. Not '
                                        'used at the moment.',
                         'set': True,
                         'path': '/dev/null'}}

The main functionality: receiving paths from the ResourceManager to specific resource items:

rpath = rm.get_path(resource="gal_visits_list", storage="vasca")
print(rpath)
/home/runner/work/vasca-mirror/vasca-mirror/vasca/test/resources/GALEX_visits_list.fits

All paths returned by get_path are verified:

from pathlib import Path

Path(rpath).exists()
True

Otherwise actionable error messages are given.

Resource name not found:

rm.get_path(resource="foo", storage="vasca")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 1
----> 1 rm.get_path(resource="foo", storage="vasca")

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/vasca/resource_manager.py:323, in ResourceManager.get_path(self, resource, storage)
    321 # validate resource name
    322 if resource not in resource_list:
--> 323     raise KeyError(
    324         str.format(
    325             "Unknown resource '{}'. Select one from {}.",
    326             resource,
    327             resource_list_verbose,
    328         )
    329     )
    331 # get resource metadata
    332 success = False

KeyError: "Unknown resource 'foo'. Select one from ['gal_visits_list(sas_cloud:0)', 'gal_fields(sas_cloud:1)', 'gal_gphoton(sas_cloud:2)', 'gal_visits_list_qualsel(sas_cloud:3)', 'gal_ds_visits_list(lustre:0)', 'gal_ds_fields(lustre:1)', 'gal_gphoton(lustre:2)', 'test_resources(vasca:0)', 'gal_visits_list(vasca:1)', 'docs_resources(vasca:2)']."

Storage system not recognized:

rm.get_path(resource="gal_visits_list", storage="foo")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[8], line 1
----> 1 rm.get_path(resource="gal_visits_list", storage="foo")

File /opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/vasca/resource_manager.py:300, in ResourceManager.get_path(self, resource, storage)
    298 # validate storage name
    299 if storage not in self.metadata["catalog"]:
--> 300     raise KeyError(
    301         str.format(
    302             ("Unknown storage system '{}'. Select one from {}."),
    303             storage,
    304             [strg for strg in list(self.metadata["catalog"].keys())],
    305         )
    306     )
    307 # list of all known resources: [<resource name>]
    308 resource_list = [
    309     self.metadata["catalog"][strg][id]["name"]
    310     for strg in self.metadata["catalog"]
    311     for id in self.metadata["catalog"][strg]
    312 ]

KeyError: "Unknown storage system 'foo'. Select one from ['sas_cloud', 'lustre', 'vasca']."