lamindb.Collection

class lamindb.Collection(artifacts: list[Artifact], name: str, description: str | None = None, meta: Any | None = None, reference: str | None = None, reference_type: str | None = None, run: Run | None = None, revises: Collection | None = None)

Bases: Record, IsVersioned, TracksRun, TracksUpdates

Collections of artifacts.

Collections provide a simple way of versioning collections of artifacts.

Parameters:
  • artifactslist[Artifact] A list of artifacts.

  • namestr A name.

  • descriptionstr | None = None A description.

  • revisesCollection | None = None An old version of the collection.

  • runRun | None = None The run that creates the collection.

  • metaArtifact | None = None An artifact that defines metadata for the collection.

  • referencestr | None = None For instance, an external ID or a URL.

  • reference_typestr | None = None For instance, "url".

See also

Artifact

Examples

Create a collection from a list of Artifact objects:

>>> collection = ln.Collection([artifact1, artifact2], name="My collection")

Create a collection that groups a data & a metadata artifact (e.g., here RxRx: cell imaging):

>>> collection = ln.Collection(data_artifact, name="My collection", meta=metadata_artifact)

Attributes

property data_artifact: Artifact | None

Access to a single data artifact.

If the collection has a single data & metadata artifact, this allows access via:

collection.data_artifact  # first & only element of collection.artifacts
collection.meta_artifact  # metadata
property ordered_artifacts: QuerySet

Ordered QuerySet of .artifacts.

Accessing the many-to-many field collection.artifacts directly gives you non-deterministic order.

Using the property .ordered_artifacts allows to iterate through a set that’s ordered in the order of creation.

property stem_uid: str

Universal id characterizing the version family.

The full uid of a record is obtained via concatenating the stem uid and version information:

stem_uid = random_base62(n_char)  # a random base62 sequence of length 12 (transform) or 16 (artifact, collection)
version_uid = "0000"  # an auto-incrementing 4-digit base62 number
uid = f"{stem_uid}{version_uid}"  # concatenate the stem_uid & version_uid
property versions: QuerySet

Lists all records of the same version family.

>>> new_artifact = ln.Artifact(df2, revises=artifact)
>>> new_artifact.save()
>>> new_artifact.versions()

Simple fields

uid: str

Universal id, valid across DB instances.

name: str

Name or title of collection (required).

description: str | None

A description.

hash: str | None

Hash of collection content. 86 base64 chars allow to store 64 bytes, 512 bits.

reference: str | None

A reference like URL or external ID.

reference_type: str | None

Type of reference, e.g., cellxgene Census collection_id.

meta_artifact: Artifact | None

An artifact that stores metadata that indexes a collection.

It has a 1:1 correspondence with an artifact. If needed, you can access the collection from the artifact via a private field: artifact._meta_of_collection.

visibility: int

Visibility of collection record in queries & searches (1 default, 0 hidden, -1 trash).

version: str | None

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

is_latest: bool

Boolean flag that indicates whether a record is the latest in its version family.

created_at: datetime

Time of creation of record.

updated_at: datetime

Time of last update to record.

Relational fields

created_by: User

Creator of record.

transform: Transform | None

Transform whose run created the collection.

run: Run | None

Run that created the collection.

ulabels: ULabel

ULabels sampled in the collection (see Feature).

input_of_runs: Run

Runs that use this collection as an input.

artifacts: Artifact

Artifacts in collection.

Class methods

classmethod df(include=None, join='inner', limit=100)

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use parameter include to include other fields.

Parameters:
  • include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "labels__name", "cell_types__name", etc. or a list of such strings.

  • join (str, default: 'inner') – The join parameter of pandas.

  • limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

>>> labels = [ln.ULabel(name="Label {i}") for i in range(3)]
>>> ln.save(labels)
>>> ln.ULabel.filter().df(include=["created_by__name"])
classmethod filter(*queries, **expressions)

Query records.

Parameters:
  • queries – One or multiple Q objects.

  • expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()
classmethod get(idlike=None, **expressions)

Get a single record.

Parameters:
  • idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.

  • expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.core.exceptions.DoesNotExist – In case no matching record is found.

See also

Examples

>>> ulabel = ln.ULabel.get("FvtpPJLJ")
>>> ulabel = ln.ULabel.get(name="my-label")
classmethod lookup(field=None, return_field=None)

Return an auto-complete object for a field.

Parameters:
  • field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.

  • return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
classmethod search(string, *, field=None, limit=20, case_sensitive=False)

Search.

Parameters:
  • string (str) – The input string to match against the field ontology values.

  • field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.

  • limit (int | None, default: 20) – Maximum amount of top results to return.

  • case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")
classmethod using(instance)

Use a non-default LaminDB instance.

Parameters:

instance (str | None) – An instance identifier of form “account_handle/instance_name”.

Return type:

QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

Methods

async adelete(using=None, keep_parents=False)
append(artifact, run=None)

Add an artifact to the collection.

Creates a new version of the collection.

Parameters:
  • artifact (Artifact) – An artifact to add to the collection.

  • run (Run | None, default: None) – The run that creates the new version of the collection.

Return type:

Collection

Added in version 0.76.14.

async arefresh_from_db(using=None, fields=None, from_queryset=None)
async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)
cache(is_run_input=None)

Download cloud artifacts in collection to local cache.

Follows synching logic: only caches outdated artifacts.

Returns paths to locally cached on-disk artifacts.

Parameters:

is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

list[UPath]

clean()

Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.

clean_fields(exclude=None)

Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.

date_error_message(lookup_type, field_name, unique_for)
delete(permanent=None)

Delete collection.

Parameters:

permanent (bool | None, default: None) – Whether to permanently delete the collection record (skips trash).

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.delete()
describe(print_types=False)

Describe relations of record.

Examples

>>> artifact.describe()
get_constraints()
get_deferred_fields()

Return a set containing names of deferred fields for this instance.

load(join='outer', is_run_input=None, **kwargs)

Stage and load to memory.

Returns in-memory representation if possible such as a concatenated DataFrame or AnnData object.

Return type:

Any

mapped(layers_keys=None, obs_keys=None, obsm_keys=None, obs_filter=None, join='inner', encode_labels=True, unknown_label=None, cache_categories=True, parallel=False, dtype=None, stream=False, is_run_input=None)

Return a map-style dataset.

Returns a pytorch map-style dataset by virtually concatenating AnnData arrays.

If your AnnData collection is in the cloud, move them into a local cache first via cache().

__getitem__ of the MappedCollection object takes a single integer index and returns a dictionary with the observation data sample for this index from the AnnData objects in the collection. The dictionary has keys for layers_keys (.X is in "X"), obs_keys, obsm_keys (under f"obsm_{key}") and also "_store_idx" for the index of the AnnData object containing this observation sample.

Note

For a guide, see Train a machine learning model on a collection.

This method currently only works for collections of AnnData artifacts.

Parameters:
  • layers_keys (str | list[str] | None, default: None) – Keys from the .layers slot. layers_keys=None or "X" in the list retrieves .X.

  • obs_keys (str | list[str] | None, default: None) – Keys from the .obs slots.

  • obsm_keys (str | list[str] | None, default: None) – Keys from the .obsm slots.

  • obs_filter (tuple[str, str | tuple[str, ...]] | None, default: None) – Select only observations with these values for the given obs column. Should be a tuple with an obs column name as the first element and filtering values (a string or a tuple of strings) as the second element.

  • join (Literal['inner', 'outer'] | None, default: 'inner') – "inner" or "outer" virtual joins. If None is passed, does not join.

  • encode_labels (bool | list[str], default: True) – Encode labels into integers. Can be a list with elements from obs_keys.

  • unknown_label (str | dict[str, str] | None, default: None) – Encode this label to -1. Can be a dictionary with keys from obs_keys if encode_labels=True or from encode_labels if it is a list.

  • cache_categories (bool, default: True) – Enable caching categories of obs_keys for faster access.

  • parallel (bool, default: False) – Enable sampling with multiple processes.

  • dtype (str | None, default: None) – Convert numpy arrays from .X, .layers and .obsm

  • stream (bool, default: False) – Whether to stream data from the array backend.

  • is_run_input (bool | None, default: None) – Whether to track this collection as run input.

Return type:

MappedCollection

Examples

>>> import lamindb as ln
>>> from torch.utils.data import DataLoader
>>> ds = ln.Collection.get(description="my collection")
>>> mapped = collection.mapped(obs_keys=["cell_type", "batch"])
>>> dl = DataLoader(mapped, batch_size=128, shuffle=True)
prepare_database_save(field)
refresh_from_db(using=None, fields=None, from_queryset=None)

Reload field values from the database.

By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.

Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.

When accessing deferred fields of an instance, the deferred loading of the field will call this method.

restore()

Restore collection record from trash.

Return type:

None

Examples

For any Collection object collection, call:

>>> collection.restore()
save(using=None)

Save the collection and underlying artifacts to database & storage.

Parameters:

using (str | None, default: None) – The database to which you want to save.

Return type:

Collection

Examples

>>> collection = ln.Collection("./myfile.csv", name="myfile")
>>> collection.save()
save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)

Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.

The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.

serializable_value(field_name)

Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.

Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.

unique_error_message(model_class, unique_check)
validate_constraints(exclude=None)
validate_unique(exclude=None)

Check unique constraints on the model and raise ValidationError if any failed.

view_lineage(with_children=True)

Graph of data flow.

Return type:

None

Notes

For more info, see use cases: Data lineage.

Examples

>>> collection.view_lineage()
>>> artifact.view_lineage()