Please refer to the sparrow vignette (vignette("sparrow")
),
(and the "The GeneSetDb Class" section, in particular) for a more deatiled
description of the sematnics of this central data object.
The GeneSetDb class serves the same purpose as the
GSEABase::GeneSetCollection()
class does: it acts as a centralized
object to hold collections of Gene Sets. The reason for its existence is
because there are things that I wanted to know about my gene set
collections that weren't easily inferred from what is essentially a
"list of GeneSets" that is the GeneSetCollection
class.
Gene Sets are internally represented by a data.table
in "a tidy"
format, where we minimally require non NA
values for the following
three character
columns:
collection
name
feature_id
The (collection
, name
) compound key is the primary key of a gene set.
There will be as many entries with the same (collection
, name
) as there
are genes/features in that set.
The GeneSetDb
tracks metadata about genesets at the collection
level. This means that we assume that all of the feature_id
's used
within a collection use the same type of feature identifier (such as
a GSEABase::EntrezIdentifier()
, were defined in the same organism,
etc.
Please refer to the "GeneSetDb" section of the vignette for more
details regarding the construction and querying of a GeneSetDb
object.
GeneSetDb(x, featureIdMap = NULL, collectionName = NULL, ...)
A GeneSetCollection
, a "two deep" list of either
GeneSetCollection
s or lists of character vectors, which are
the gene identifers. The "two deep" list represents the different
collections (top level) at the top level, and each such list is a named
list itself, which represents the gene sets in the given collection.
A data.frame with 2 character columns. The first
column is the ids of the genes (features) used to identify the genes in
gene.sets
, the second second column are IDs that this should be mapped
to. Useful for testing probelevel microarray data to gene level gene set
information.
If x
represents a singular collection, ie.
a single GeneSetCollection
or a "one deep" (named (by geneset))
list of genesets, then this parameter provides the name for the
collection. If x
is multiple collections, this can be character
vector of same length with the names. In all cases, if a collection name
can't be defined from this, then collections will be named anonymously.
If a value is passed here, it will overide any names stored in the list of
x
.
these aren't used for anything in particular, but are here to catch extra arguments that may get passed down if this function is part of some call chain.
A GeneSetDb object
The functionality in the class is useful for the functionality in this
package, but for your own personal usage, you probably want a {BiocSet}
.
table
The "gene set table": a data.table with geneset information,
one row per gene set. Columns include collection, name, N, and n. The end
user can add more columns to this data.table as desired. The actual
feature IDs are computed on the fly by doing a db[J(group, id)]
db
A data.table
to hold all of the original geneset id
information that was used to construct this GeneSetDb
.
featureIdMap
Maps the ids used in the geneset lists to the ids (rows) over the expression data the GSEA is run on
collectionMetadata
A data.table
to keep metadata about each
individual geneset collection, ie. the user might want to keep track of
where the geneset definitions come from. Perhaps a function that parses
the collection,name combination to generate an URL that points the user
to more information about that geneset, etc.
The GeneSetDb()
constructor is sufficiently flexible enough to create
a GeneSetDb
object from a variety of formats that are commonly used
in the bioconductor echosystem, such as:
GSEABase::GeneSetCollection()
: If you already have a GeneSetCollection
on your hands, you can simply pass it to the GeneSetDb()
constructor.
list of ids: This format is commonly used to define gene sets in the
edgeR/limma universe for testing with camera, roast, romer, etc. The names
of the list items are the gene set names, and their values are a character
vector of gene identifiers. When it's a single list of lists, you must
provide a value for collectionName
. You can embed multiple
collections of gene sets by having a three-deep list-of-lists-of-ids.
The top level list define the different collections, the second level
are the genesets, and the third level are the feature identifiers for
each gene set. See the examples for clarification.
a data.frame
-like object: To keep track of your own custom gene sets, you
have probably realized the importance of maintaing your own sanity, and
likely have gene sets organized in a table like object that has something
like the collection
, name
, and feature_id
required for a GeneSetDb
.
Simply rename the appropriate columns to the ones prescribed here, and pass
that into the constructor. Any other additional columns (symbol, direction,
etc.) will be copied into the GeneSetDb
.
You might wonder what gene sets are defined in a GeneSetDb
: see
the geneSets()
function.
Curious about what features are defined in your GeneSetDb
? See
the featureIds()
function.
Want the details of a particular gene set? Try the geneSet()
function.
This will return a data.frame
of the gene set definition. Calling
geneSet()
on a SparrowResult()
will return the same data.frame
along
with the differential expression statistics for the individual members of the
geneSet across the contrast that was tested in the seas()
call that
created the SparrowResult()
.
You can subset a GeneSetDb to include a subset of genesets defined in it.
To do this, you need to provide an indexing vector that is as long as
length(gdb)
, ie. the number of gene sets defined in GeneSetDb. You
can construct such a vector by performing your boolean logic over the
geneSets(gdb)
table.
Look at the Examples section to see how this works, where we take the MSIgDB c7 collection (aka. "ImmuneSigDB") and only keep gene sets that were defined in experiments from mouse.
## exampleGeneSetDF provides gene set definitions in "long form". We show
## how this can easily turned into a GeneSetDb from this form, or convert
## it to other forms (list of features, or list of list of features) to
## do the same.
gs.df <- exampleGeneSetDF()
gdb.df <- GeneSetDb(gs.df)
## list of ids
gs.df$key <- encode_gskey(gs.df)
gs.list <- split(gs.df$feature_id, gs.df$key)
gdb.list <- GeneSetDb(gs.list, collectionName='custom-sigs')
## A list of lists, where the top level list splits the collections.
## The name of the collection in the GeneSetDb is taken from this top level
## hierarchy
gs.lol <- as.list(gdb.df, nested=TRUE) ## examine this list-of lists
gdb.lol <- GeneSetDb(gs.lol) ## note that collection is set propperly
## GeneSetDb Interrogation
gsets <- geneSets(gdb.df)
nkcells <- geneSet(gdb.df, 'cellularity', 'NK cells')
fids <- featureIds(gdb.df)
# GeneSetDb Manipulation ....................................................
# Subset down to only t cell related gene sets
gdb.t <- gdb.df[grepl("T cell", geneSets(gdb.df)$name)]
gdb.t
#> ===============================================================================
#> GeneSetDb with 2 defined genesets across 1 collections (0 gene sets are active)
#> Conformed: no
#> -------------------------------------------------------------------------------
#> Key: <collection, name>
#> collection name active N n
#> <char> <char> <lgcl> <int> <int>
#> 1: cellularity T cells CD4 FALSE 15 NA
#> 2: cellularity T cells CD8 FALSE 15 NA
#> -------------------------------------------------------------------------------
#> GeneSetDb with 2 defined genesets across 1 collections (0 gene sets are active)
#> Conformed: no
#> ===============================================================================