Please refer to the sparrow vignette (vignette("sparrow")), (and the "The GeneSetDb Class" section, in particular) for a more deatiled description of the sematnics of this central data object.

The GeneSetDb class serves the same purpose as the GSEABase::GeneSetCollection() class does: it acts as a centralized object to hold collections of Gene Sets. The reason for its existence is because there are things that I wanted to know about my gene set collections that weren't easily inferred from what is essentially a "list of GeneSets" that is the GeneSetCollection class.

Gene Sets are internally represented by a data.table in "a tidy" format, where we minimally require non NA values for the following three character columns:

  • collection

  • name

  • feature_id

The (collection, name) compound key is the primary key of a gene set. There will be as many entries with the same (collection, name) as there are genes/features in that set.

The GeneSetDb tracks metadata about genesets at the collection level. This means that we assume that all of the feature_id's used within a collection use the same type of feature identifier (such as a GSEABase::EntrezIdentifier(), were defined in the same organism, etc.

Please refer to the "GeneSetDb" section of the vignette for more details regarding the construction and querying of a GeneSetDb object.

GeneSetDb(x, featureIdMap = NULL, collectionName = NULL, ...)

Arguments

x

A GeneSetCollection, a "two deep" list of either GeneSetCollections or lists of character vectors, which are the gene identifers. The "two deep" list represents the different collections (top level) at the top level, and each such list is a named list itself, which represents the gene sets in the given collection.

featureIdMap

A data.frame with 2 character columns. The first column is the ids of the genes (features) used to identify the genes in gene.sets, the second second column are IDs that this should be mapped to. Useful for testing probelevel microarray data to gene level gene set information.

collectionName

If x represents a singular collection, ie. a single GeneSetCollection or a "one deep" (named (by geneset)) list of genesets, then this parameter provides the name for the collection. If x is multiple collections, this can be character vector of same length with the names. In all cases, if a collection name can't be defined from this, then collections will be named anonymously. If a value is passed here, it will overide any names stored in the list of x.

...

these aren't used for anything in particular, but are here to catch extra arguments that may get passed down if this function is part of some call chain.

Value

A GeneSetDb object

Details

The functionality in the class is useful for the functionality in this package, but for your own personal usage, you probably want a {BiocSet}.

Slots

table

The "gene set table": a data.table with geneset information, one row per gene set. Columns include collection, name, N, and n. The end user can add more columns to this data.table as desired. The actual feature IDs are computed on the fly by doing a db[J(group, id)]

db

A data.table to hold all of the original geneset id information that was used to construct this GeneSetDb.

featureIdMap

Maps the ids used in the geneset lists to the ids (rows) over the expression data the GSEA is run on

collectionMetadata

A data.table to keep metadata about each individual geneset collection, ie. the user might want to keep track of where the geneset definitions come from. Perhaps a function that parses the collection,name combination to generate an URL that points the user to more information about that geneset, etc.

GeneSetDb Construction

The GeneSetDb() constructor is sufficiently flexible enough to create a GeneSetDb object from a variety of formats that are commonly used in the bioconductor echosystem, such as:

  • GSEABase::GeneSetCollection(): If you already have a GeneSetCollection on your hands, you can simply pass it to the GeneSetDb() constructor.

  • list of ids: This format is commonly used to define gene sets in the edgeR/limma universe for testing with camera, roast, romer, etc. The names of the list items are the gene set names, and their values are a character vector of gene identifiers. When it's a single list of lists, you must provide a value for collectionName. You can embed multiple collections of gene sets by having a three-deep list-of-lists-of-ids. The top level list define the different collections, the second level are the genesets, and the third level are the feature identifiers for each gene set. See the examples for clarification.

  • a data.frame-like object: To keep track of your own custom gene sets, you have probably realized the importance of maintaing your own sanity, and likely have gene sets organized in a table like object that has something like the collection, name, and feature_id required for a GeneSetDb. Simply rename the appropriate columns to the ones prescribed here, and pass that into the constructor. Any other additional columns (symbol, direction, etc.) will be copied into the GeneSetDb.

Interrogating a GeneSetDb

You might wonder what gene sets are defined in a GeneSetDb: see the geneSets() function.

Curious about what features are defined in your GeneSetDb? See the featureIds() function.

Want the details of a particular gene set? Try the geneSet() function. This will return a data.frame of the gene set definition. Calling geneSet() on a SparrowResult() will return the same data.frame along with the differential expression statistics for the individual members of the geneSet across the contrast that was tested in the seas() call that created the SparrowResult().

GeneSetDb manipulation

You can subset a GeneSetDb to include a subset of genesets defined in it. To do this, you need to provide an indexing vector that is as long as length(gdb), ie. the number of gene sets defined in GeneSetDb. You can construct such a vector by performing your boolean logic over the geneSets(gdb) table.

Look at the Examples section to see how this works, where we take the MSIgDB c7 collection (aka. "ImmuneSigDB") and only keep gene sets that were defined in experiments from mouse.

See also

Examples

## exampleGeneSetDF provides gene set definitions in "long form". We show
## how this can easily turned into a GeneSetDb from this form, or convert
## it to other forms (list of features, or list of list of features) to
## do the same.
gs.df <- exampleGeneSetDF()
gdb.df <- GeneSetDb(gs.df)

## list of ids
gs.df$key <- encode_gskey(gs.df)
gs.list <- split(gs.df$feature_id, gs.df$key)
gdb.list <- GeneSetDb(gs.list, collectionName='custom-sigs')

## A list of lists, where the top level list splits the collections.
## The name of the collection in the GeneSetDb is taken from this top level
## hierarchy
gs.lol <- as.list(gdb.df, nested=TRUE) ## examine this list-of lists
gdb.lol <- GeneSetDb(gs.lol) ## note that collection is set propperly

## GeneSetDb Interrogation
gsets <- geneSets(gdb.df)
nkcells <- geneSet(gdb.df, 'cellularity', 'NK cells')
fids <- featureIds(gdb.df)

# GeneSetDb Manipulation ....................................................
# Subset down to only t cell related gene sets
gdb.t <- gdb.df[grepl("T cell", geneSets(gdb.df)$name)]
gdb.t
#> ===============================================================================
#> GeneSetDb with 2 defined genesets across 1 collections (0 gene sets are active)
#>   Conformed: no
#> -------------------------------------------------------------------------------
#> Key: <collection, name>
#>     collection        name active     N     n
#>         <char>      <char> <lgcl> <int> <int>
#> 1: cellularity T cells CD4  FALSE    15    NA
#> 2: cellularity T cells CD8  FALSE    15    NA
#> -------------------------------------------------------------------------------
#> GeneSetDb with 2 defined genesets across 1 collections (0 gene sets are active)
#>   Conformed: no
#> ===============================================================================