Dataset Catalog

Usage

When landing on the /catalog/ page you should expect to see the following, where clicking on BSR datasets will take you to a list of available datasets to download.

_images/dataset_catalog.png

Note

BSR” stands for Biosample Spectral Repository.

Note

Privileged permissions are required to add cataloged datasets, so don’t worry if you don’t see the +Add button.

Here’s an example list of available datasets.

_images/dataset_list.png

From there the datasets themselves can be downloaded as zip files by clicking on the link under the FILE column.

_images/dataset_list_download.png

Note

As a security measure the download URLs are only temporary. Attempting to use them once they have expired will not work and will result in a AccessDenied error. To generate new ones, please refresh the page.

Datasets can be filtered by name.

_images/dataset_filter.png

Clicking on the dataset NAME will then take you to the dataset details.

_images/dataset_view.png

Details are organized and some will be collapsed, when expanded expect something similar to the following.

_images/dataset_view_expanded.png

Fields

  • Query: The query used to produce a dataset is itself modeled and stored in the database - this field is the query’s name. To add new queries see SQL Explorer.

  • Name: This is the name given to the dataset.

  • Version: Datasets are versioned to distinguish between those generated by the same query at different times in the lifetime of the database. I.e., different snapshots as the database contents grows.

  • Description: A more verbose text describing the semantics and contents of this particular dataset.

  • File: This is the link to the zip file. Click to download.

  • SHA-256: The SHA-256 checksum of the entire zip file. See Computing Data File Integrity.

  • Size: The size of the downloadable zip file in MB (0MB implies a size of the order of KB).

  • SQL: This is the SQL query used to generate the dataset from the database.

  • Data SHA-256:: The SHA-256 checksum of the data file archived within the zip file. See Computing Data File Integrity.

  • App version: The version of the application deployed and thus used to generate the dataset.

  • ID: The primary key for the dataset as stored in the website’s main database.

  • N rows: The number of rows in the zipped data file. Depending on the query, this could be the total number of patients or something else.

  • N spectral data files: The number of individual spectral data files zipped within the downloadable zip file.

  • Spectral data filenames: A list of all the file names for all individual spectral data files zipped within the downloadable zip file.

Computing Data File Integrity

The SHA-256 field is the SHA-256 checksum of the zip file that once downloaded, its integrity can be verified by computing the checksum and comparing it to the value of this field - they should be identical.

Checksums can be computed on a unix machine (e.g., MacOSX or Linux) via the terminal using the following command:

shasum -a 256 path/to/downloaded/file.zip

Using a Windows CMD prompt:

certutil -hashfile C:path\to\downloaded\file.zip SHA256

Similarly the Data SHA-256 field is the checksum for the data table file archived within the zip file, e.g., “BSR.csv”.

Note

BSR” stands for Biosample Spectral Repository.