Title: | Complementary Genomic Functions |
---|---|
Description: | Presents a series of molecular and genetic routines in the R environment with the aim of assisting in analytical pipelines before and after the use of 'asreml' or another library to perform analyses such as Genomic Selection or Genome-Wide Association Analyses. Methods and examples are described in Gezan, Oliveira, Galli, and Murray (2022) <https://asreml.kb.vsni.co.uk/wp-content/uploads/sites/3/ASRgenomics_Manual.pdf>. |
Authors: | Salvador Gezan [aut, cre], Darren Murray [aut], Amanda Avelar de Oliveira [aut], Giovanni Galli [aut], VSN International [cph] |
Maintainer: | Salvador Gezan <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.4 |
Built: | 2025-02-24 03:17:58 UTC |
Source: | https://github.com/cran/ASRgenomics |
Modifies the input square matrix into its sparse form with three columns per line,
corresponding to the set: Row, Col, Value
. This matrix defines the lower triangle
row-wise of the original matrix and it is sorted as columns within row.
Values of zero on the matrix are dropped by default. Individual
names should be assigned to rownames
and colnames
.
full2sparse(K = NULL, drop.zero = TRUE)
full2sparse(K = NULL, drop.zero = TRUE)
K |
A square matrix in full form ( |
drop.zero |
If |
Based on the function published by Borgognone et al. (2016).
A matrix in sparse form with columns: Row, Col, Value
together with the attributes rowNames
and colNames
.
If attribute INVERSE
is found this is also passed to the sparse matrix.
Borgognone, M.G., Butler, D.G., Ogbonnaya, F.C and Dreccer, M.F. 2016. Molecular marker information in the analysis of multi-environment trial helps differentiate superior genotypes from promising parents. Crop Science 56:2612-2628.
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = FALSE)$G G[1:5, 1:5] # Transform matrix into sparse. G.sparse <- full2sparse(K = G) head(G.sparse) head(attr(G.sparse, "rowNames")) head(attr(G.sparse, "colNames"))
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = FALSE)$G G[1:5, 1:5] # Transform matrix into sparse. G.sparse <- full2sparse(K = G) head(G.sparse) head(attr(G.sparse, "rowNames")) head(attr(G.sparse, "colNames"))
Generates the inverse of a genomic relationship matrix that is provided.
This input matrix should be of the full form (
)
with individual names assigned to
rownames
and colnames
. Several checks for the stability of the matrix are
presented based on the reciprocal conditional number.
In case of an ill-conditioned matrix,
options of blending, bending or aligning before inverting are available.
These options will be deprecated (discontinued)
in future versions of ASRgenomics
as they can be better implemented in the function G.tuneup()
.
Based on procedures published by Nazarian and Gezan et al. (2016).
G.inverse( G = NULL, A = NULL, rcn.thr = 1e-12, blend = FALSE, pblend = 0.02, bend = FALSE, eig.tol = NULL, align = FALSE, digits = 8, sparseform = FALSE, message = TRUE )
G.inverse( G = NULL, A = NULL, rcn.thr = 1e-12, blend = FALSE, pblend = 0.02, bend = FALSE, eig.tol = NULL, align = FALSE, digits = 8, sparseform = FALSE, message = TRUE )
G |
Input of the symmetric genomic relationship matrix |
A |
Input of the pedigree relationship matrix |
rcn.thr |
A threshold for identifying the |
blend |
If |
pblend |
If blending is requested this is the proportion of the identity matrix |
bend |
If |
eig.tol |
Defines relative positiveness (i.e., non-zero) of eigenvalues compared to the largest one.
It determines which threshold of eigenvalues will be treated as zero (default = |
align |
If |
digits |
Set up the number of digits in used to round the output matrix (default = |
sparseform |
If |
message |
If |
A list with three of the following elements:
Ginv
: the inverse of matrix in full form (only if
sparseform = FALSE
).
Ginv.sparse
: the inverse of matrix in sparse form (only if
sparseform = TRUE
).
status
: the status (ill-conditioned
or well-conditioned
) of
the inverse of matrix.
rcn
: the reciprocal conditional number of the inverse of matrix.
Nazarian A., Gezan S.A. 2016. GenoMatrix: A software package for pedigree-based and genomic prediction analyses on complex traits. Journal of Heredity 107:372-379.
# Example: An ill-conditioned matrix. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Get the inverse of G. GINV <- G.inverse(G = G, bend = FALSE, blend = FALSE, align = FALSE) GINV$Ginv[1:5, 1:5] GINV$status
# Example: An ill-conditioned matrix. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Get the inverse of G. GINV <- G.inverse(G = G, bend = FALSE, blend = FALSE, align = FALSE) GINV$Ginv[1:5, 1:5] GINV$status
Generates the genomic numerator relationship matrix for
additive (VanRaden or Yang) or dominant (Su or Vitezica) relationships.
Matrix provided is of the form
, with
individuals and
markers.
Individual and
marker names are assigned to
rownames
and colnames
, respectively.
SNP data is coded as 0, 1, 2 (integers or decimal numbers).
Missing values, if present, need to be specified.
G.matrix( M = NULL, method = "VanRaden", na.string = "NA", sparseform = FALSE, digits = 8, message = TRUE )
G.matrix( M = NULL, method = "VanRaden", na.string = "NA", sparseform = FALSE, digits = 8, message = TRUE )
M |
A matrix with SNP data of form |
method |
The method considered for calculation of genomic matrix.
Options are: |
na.string |
A character that is interpreted as missing values (default = |
sparseform |
If |
digits |
Set up the number of digits used to round the output matrix (default = 8). |
message |
If |
Note: If data is provided with missing values, it will process calculations of relationships on pairwise non-missing data.
It uses function Gmatrix
for calculations
from R package AGHmatrix (Amadeu et al. 2019).
A list with one of these two elements:
G
: the matrix in full form (only if
sparseform = FALSE
).
G.sparse
: the matrix in sparse form (only if
sparseform = TRUE
).
Amadeu, R.R., Cellon, C., Olmstead, J.W., Garcia, A.A.F, Resende, M.F.R. and P.R. Munoz. 2016. AGHmatrix: R package to construct relationship matrices for autotetraploid and diploid species: A blueberry example. The Plant Genome 9(3). doi: 10.3835/plantgenome2016.01.0009
# Example: Requesting a full matrix by VanRanden. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Example: Requesting a sparse form by VanRanden. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = TRUE)$G.sparse head(G) head(attr(G, "rowNames"))
# Example: Requesting a full matrix by VanRanden. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Example: Requesting a sparse form by VanRanden. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = TRUE)$G.sparse head(G) head(attr(G, "rowNames"))
Predicts random effects values for individuals with unobserved responses (here called x
,
a vector of length ) based on known random effect values for individuals with
observed responses (here called
y
, a vector of length ). This is done using the
common genomic relationship matrix
for all
individuals (full matrix of dimension
).
The prediction of unobserved responses will be performed through the
multivariante Normal conditional distribution. These predictions are identical to
what would be obtained if the entire set of individuals () were included into a
GBLUP animal model fit with individuals in the set
x
coded as missing.
The user needs to provide the matrix in full form.
Individual names (
) should be assigned to
rownames
and colnames
, and these
can be in any order. If the variance-covariance matrix of the set y
is provided,
standard errors of random effects in set x
are calculated.
G.predict(G = NULL, gy = NULL, vcov.gy = NULL)
G.predict(G = NULL, gy = NULL, vcov.gy = NULL)
G |
Input of the genomic relationship matrix |
gy |
Input of random effects (e.g. breeding values) for individuals with known values.
Individual names should be assigned to |
vcov.gy |
The variance-covariance matrix associated with the random effects from the
individuals with known values (set |
A data frame with the predicted random effect values for individuals with
unobserved responses in the set x
. If the variance-covariance matrix is provided,
standard errors are included in an additional column.
## Not run: library(asreml) # Load asreml. # Example: Apple data creating 100 missing observations. # Prepare G (nx + ny). G <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = FALSE)$G dim(G) # Prepare subset of data. # Select only 147 from 247 individuals from pheno.apple and geno.apple. Gy <- G[1:147, 1:147] phenoy <- pheno.apple[1:147, ] # Obtain the BLUPs for the 147 individuals using ASReml-R. # Blend Gy. Gyb <- G.tuneup(G = Gy, blend = TRUE, pblend = 0.02)$Gb # Get the Gy inverse Gyinv <- G.inverse(G = Gyb, sparseform = TRUE)$Ginv.sparse # Fit a GBLUP model phenoy$INDIV <- as.factor(phenoy$INDIV) modelGBLUP <- asreml( fixed = JUI_MOT ~ 1, random = ~vm(INDIV, Gyinv), workspace = 128e06, data = phenoy) # Obtain Predictions - BLUP (set y). BLUP <- summary(modelGBLUP,coef = TRUE)$coef.random head(BLUP) gy <- as.matrix(BLUP[, 1]) rownames(gy) <- phenoy$INDIV # Ready to make conditional predictions. g.cond <- G.predict(G = G, gy = gy) head(g.cond) ## End(Not run)
## Not run: library(asreml) # Load asreml. # Example: Apple data creating 100 missing observations. # Prepare G (nx + ny). G <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = FALSE)$G dim(G) # Prepare subset of data. # Select only 147 from 247 individuals from pheno.apple and geno.apple. Gy <- G[1:147, 1:147] phenoy <- pheno.apple[1:147, ] # Obtain the BLUPs for the 147 individuals using ASReml-R. # Blend Gy. Gyb <- G.tuneup(G = Gy, blend = TRUE, pblend = 0.02)$Gb # Get the Gy inverse Gyinv <- G.inverse(G = Gyb, sparseform = TRUE)$Ginv.sparse # Fit a GBLUP model phenoy$INDIV <- as.factor(phenoy$INDIV) modelGBLUP <- asreml( fixed = JUI_MOT ~ 1, random = ~vm(INDIV, Gyinv), workspace = 128e06, data = phenoy) # Obtain Predictions - BLUP (set y). BLUP <- summary(modelGBLUP,coef = TRUE)$coef.random head(BLUP) gy <- as.matrix(BLUP[, 1]) rownames(gy) <- phenoy$INDIV # Ready to make conditional predictions. g.cond <- G.predict(G = G, gy = gy) head(g.cond) ## End(Not run)
Generates a new matrix that can be blended, bended or aligned
in order to make it stable for future use or matrix inversion. The input matrix should
be of the full form () with individual names assigned to
rownames
and colnames
.
This routine provides three options of tune-up:
Blend. The matrix is blended (or averaged)
with another matrix.
, where
is the blended matrix,
is the original matrix,
is the proportion of
the identity matrix or pedigree-based relationship matrix to consider, and
is the matrix to blend. Ideally, the pedigree-based
relationship matrix should be used, but if this is not available (or it is
of poor quality), then it is replaced by an identity matrix
.
Bend. It consists on adjusting the original
matrix to obtain a near positive definite matrix, which
is done by making all negative or very small eigenvalues slightly positive.
Align. The original matrix is aligned to
the matching pedigree relationship matrix
where an
and
parameters are obtained. More information can
be found in the manual or in Christensen et al. (2012).
The user should provide the matrices and
in full form (
)
and matching individual names should be
assigned to the
rownames
and colnames
of the matrices.
Based on procedures published by Nazarian and Gezan et al. (2016).
G.tuneup( G = NULL, A = NULL, blend = FALSE, pblend = 0.02, bend = FALSE, eig.tol = 1e-06, align = FALSE, rcn = TRUE, digits = 8, sparseform = FALSE, determinant = TRUE, message = TRUE )
G.tuneup( G = NULL, A = NULL, blend = FALSE, pblend = 0.02, bend = FALSE, eig.tol = 1e-06, align = FALSE, rcn = TRUE, digits = 8, sparseform = FALSE, determinant = TRUE, message = TRUE )
G |
Input of the genomic matrix |
A |
Input of the matching pedigree relationship matrix |
blend |
If |
pblend |
If blending is requested this is the proportion of the identity matrix |
bend |
If |
eig.tol |
Defines relative positiveness (i.e., non-zero) of eigenvalues compared to the largest one.
It determines which threshold of eigenvalues will be treated as zero (default = |
align |
If |
rcn |
If |
digits |
Set up the number of digits used to round the output matrix (default = |
sparseform |
If |
determinant |
If |
message |
If |
A list with six of the following elements:
Gb
: the inverse of matrix in full form
(only if sparseform =
FALSE
).
Gb.sparse
: if requested, the inverse of matrix in
sparse form (only if sparseform =
TRUE
).
rcn0
: the reciprocal conditional number of the original matrix.
Values near zero are associated with an ill-conditioned matrix.
rcnb
: the reciprocal conditional number of the blended, bended or aligned
matrix. Values near zero are associated with an ill-conditioned matrix.
det0
: if requested, the determinant of the original matrix.
blend
: if the matrix was blended.
bend
: if the matrix was bended.
align
: if the matrix was aligned.
Christensen, O.F., Madsen, P., Nielsen, B., Ostersen, T. and Su, G. 2012. Single-step methods for genomic evaluation in pigs. Animal 6:1565-1571. doi:10.1017/S1751731112000742.
Nazarian A., Gezan S.A. 2016. GenoMatrix: A software package for pedigree-based and genomic prediction analyses on complex traits. Journal of Heredity 107:372-379.
# Example: Apple dataset. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Blend G matrix. G_blended <- G.tuneup(G = G, blend = TRUE, pblend = 0.05) G_blended$Gb[1:5, 1:5] # Bend G matrix. G_bended <- G.tuneup(G = G, bend = TRUE, eig.tol = 1e-03) G_bended$Gb[1:5, 1:5] # Example: Loblolly Pine dataset with pedigree - Aligned G matrix. A <- AGHmatrix::Amatrix(ped.pine) dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9")$M.clean # Get G matrix. G <- G.matrix(M = M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G and A. Aclean <- match.G2A(A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE)$Aclean # Align G with A. G_align <- G.tuneup(G = G, A = Aclean, align = TRUE) G_align$Gb[1:5, 1:5]
# Example: Apple dataset. # Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Blend G matrix. G_blended <- G.tuneup(G = G, blend = TRUE, pblend = 0.05) G_blended$Gb[1:5, 1:5] # Bend G matrix. G_bended <- G.tuneup(G = G, bend = TRUE, eig.tol = 1e-03) G_bended$Gb[1:5, 1:5] # Example: Loblolly Pine dataset with pedigree - Aligned G matrix. A <- AGHmatrix::Amatrix(ped.pine) dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9")$M.clean # Get G matrix. G <- G.matrix(M = M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G and A. Aclean <- match.G2A(A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE)$Aclean # Align G with A. G_align <- G.tuneup(G = G, A = Aclean, align = TRUE) G_align$Gb[1:5, 1:5]
Genotypic data on 247 apple clones (i.e., genotypes) with a total of 2,828 SNP markers (coded as 0, 1, 2 and there are no missing records). Dataset obtained from supplementary material in Kumar et al. (2015).
geno.apple
geno.apple
matrix
Kumar S., Molloy C., Muñoz P., Daetwyler H., Chagné D., and Volz R. 2015. Genome-enabled estimates of additive and nonadditive genetic variances and prediction of apple phenotypes across environments. G3 Genes, Genomes, Genetics 5:2711-2718.
geno.apple[1:5, 1:5]
geno.apple[1:5, 1:5]
Genotypic data for a total of 4,853 SNPs (coded as 0, 1, 2 and -9 for missing) on 655 genotypes of Loblolly Pine (Pinus taeda L.). Dataset modified from supplementary material from Resende et al. (2012). This dataset differs from the original as some genotypes were made artificially missing by full-sib family.
geno.pine655
geno.pine655
matrix
Resende, M.F.R., Munoz, P. Resende, M.D.V., Garrick, D.J., Fernando, R.L., Davis, J.M., Jokela, E.J., Martin, T.A., Peter, G.F., and Kirst, M. 2012. Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.). Genetics 190:1503-1510.
geno.pine655[1:5, 1:5]
geno.pine655[1:5, 1:5]
Genotypic data for a total of 4,853 SNPs (coded as 0, 1, 2 and -9 for missing) on 926 genotypes of Loblolly Pine (Pinus taeda L.). Dataset obtained from supplementary material in Resende et al. (2012).
geno.pine926
geno.pine926
matrix
Resende, M.F.R., Munoz, P. Resende, M.D.V., Garrick, D.J., Fernando, R.L., Davis, J.M., Jokela, E.J., Martin, T.A., Peter, G.F., and Kirst, M. 2012. Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.). Genetics 190:1503-1510.
geno.pine926[1:5, 1:5]
geno.pine926[1:5, 1:5]
Genotypic data on 1,481 Atlantic salmon samples. A total of 17,156
SNP markers (coded as 0, 1, 2 and NA
for missing) are included in this dataset.
Dataset obtained from supplementary material in Robledo et al. (2018).
geno.salmon
geno.salmon
matrix
Robledo D., Matika O., Hamilton A., and Houston R.D. 2018. Genome-wide association and genomic selection for resistance to amoebic gill disease in Atlantic salmon. G3 Genes, Genomes, Genetics 8:1195-1203.
geno.salmon[1:5, 1:5]
geno.salmon[1:5, 1:5]
The single-step GBLUP approach combines the information from the pedigree relationship matrix
and the genomic relationship matrix
in one
hybrid relationship matrix called
.
This function will calculate directly the inverse of this matrix
.
The user should provide the matrices
or
its inverse (only one of these is required) and the
inverse of the matrix
(
) in its full form. Individual names should
be assigned to
rownames
and colnames
, and individuals from
are verified to be all a subset within individuals from
(or
).
H.inverse( A = NULL, Ainv = NULL, Ginv = NULL, lambda = NULL, tau = 1, omega = 1, sparseform = FALSE, keep.order = TRUE, digits = 8, inverse = TRUE, message = TRUE )
H.inverse( A = NULL, Ainv = NULL, Ginv = NULL, lambda = NULL, tau = 1, omega = 1, sparseform = FALSE, keep.order = TRUE, digits = 8, inverse = TRUE, message = TRUE )
A |
Input of the pedigree relationship matrix |
Ainv |
Input of the inverse of the pedigree relationship matrix
|
Ginv |
Input of the inverse of the genomic relationship matrix
|
lambda |
The scaling factor for |
tau |
The scaling factor for |
omega |
The scaling factor for |
sparseform |
If |
keep.order |
If |
digits |
Set up the number of digits used to round the output matrix (default = |
inverse |
If |
message |
If |
The generation of the matrix contains a few scaling factors
to help with the calculation of this inverse and
to allow further exploration of the combination of the
information from the
and
.
We follow the specifications described by Martini et. al (2018),
which is done by specifying the parameters
, or the pair
and
.
The general expression used is:
and a more common representation of the above expression is found when , as shown below:
If inverse = FALSE
the
matrix is provided instead of its inverse. This option will be deprecated and
it is better to use the function H.matrix.
The matrix is obtained with the following equations:
The inverse of the hybrid matrix matrix, in full or sparse form with
required attributes to be used in asreml.
Christensen, O.F., Lund, M.S. 2010. Genomic prediction matrix when some animals are not genotyped. Gen. Sel. Evol. 42(2):1–8.
Christensen, O., Madsen, P., Nielsen, B., Ostersen, T., and Su, G. 2012. Single-step methods for genomic evaluation in pigs. Animal 6(10):1565–1571.
Legarra, A., Aguilar, I., and Misztal, I. 2009. A relationship matrix including full pedigree and genomic information. J. Dairy Sci. 92:4656-4663.
Martini, J.W.R., Schrauf, M.F., Garcia-Baccino, C.A., Pimentel, E.C.G., Munilla, S.,
Rogberg-Muñoz, A., Cantet, R.J.C., Reimer, C., Gao, N., Wimmer, V., and Simianer, H. 2018.
The effect of the scaling factors
and
on the structure of
in the single-step procedure.
Genet. Sel. Evol. 50:1-9.
# Get A matrix. A <- AGHmatrix::Amatrix(data = ped.pine) A[1:5,1:5] dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9", plots = FALSE)$M.clean # Get G matrix. G <- G.matrix(M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G and A. check <- match.G2A( A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE, RMdiff = TRUE) # Align G matrix with A. G_align <- G.tuneup(G = check$Gclean, A = check$Aclean, align = TRUE, sparseform = FALSE)$Gb # Get Ginverse using the G aligned. Ginv <- G.inverse(G = G_align, sparseform = FALSE)$Ginv Ginv[1:5, 1:5] dim(Ginv) # Obtain Hinv. Hinv <- H.inverse(A = A, Ginv = Ginv, lambda = 0.90, sparseform = TRUE) head(Hinv) attr(Hinv, "INVERSE")
# Get A matrix. A <- AGHmatrix::Amatrix(data = ped.pine) A[1:5,1:5] dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9", plots = FALSE)$M.clean # Get G matrix. G <- G.matrix(M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G and A. check <- match.G2A( A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE, RMdiff = TRUE) # Align G matrix with A. G_align <- G.tuneup(G = check$Gclean, A = check$Aclean, align = TRUE, sparseform = FALSE)$Gb # Get Ginverse using the G aligned. Ginv <- G.inverse(G = G_align, sparseform = FALSE)$Ginv Ginv[1:5, 1:5] dim(Ginv) # Obtain Hinv. Hinv <- H.inverse(A = A, Ginv = Ginv, lambda = 0.90, sparseform = TRUE) head(Hinv) attr(Hinv, "INVERSE")
matrixThe single-step GBLUP approach combines the information from the pedigree relationship matrix
and the genomic relationship matrix
in one
hybrid relationship matrix called
.
This function will calculate directly this matrix
.
The user should provide the matrices
or
its inverse (only one of these is required) and the
inverse of the matrix
(
) in its full form.
Individual names should
be assigned to
rownames
and colnames
, and individuals from
are verified to be all a subset within individuals from
(or
).
This function is a wrapper of the H.inverse function.
H.matrix( A = NULL, Ainv = NULL, Ginv = NULL, lambda = NULL, tau = 1, omega = 1, sparseform = FALSE, keep.order = TRUE, digits = 8, message = TRUE )
H.matrix( A = NULL, Ainv = NULL, Ginv = NULL, lambda = NULL, tau = 1, omega = 1, sparseform = FALSE, keep.order = TRUE, digits = 8, message = TRUE )
A |
Input of the pedigree relationship matrix |
Ainv |
Input of the inverse of the pedigree relationship matrix
|
Ginv |
Input of the inverse of the genomic relationship matrix
|
lambda |
The scaling factor for |
tau |
The scaling factor for |
omega |
The scaling factor for |
sparseform |
If |
keep.order |
If |
digits |
Set up the number of digits used to round the output matrix (default = |
message |
If |
This function is currently equivalent to using H.inverse with (inverse = FALSE
).
The matrix is obtained with the following equations:
The hybrid matrix matrix, in full or sparse form.
Christensen, O.F., Lund, M.S. 2010. Genomic prediction matrix when some animals are not genotyped. Gen. Sel. Evol. 42(2):1–8.
Christensen, O., Madsen, P., Nielsen, B., Ostersen, T., and Su, G. 2012. Single-step methods for genomic evaluation in pigs. Animal 6(10):1565–1571.
Legarra, A., Aguilar, I., and Misztal, I. 2009. A relationship matrix including full pedigree and genomic information. J. Dairy Sci. 92:4656-4663.
Martini, J.W.R., Schrauf, M.F., Garcia-Baccino, C.A., Pimentel, E.C.G., Munilla, S.,
Rogberg-Muñoz, A., Cantet, R.J.C., Reimer, C., Gao, N., Wimmer, V., and Simianer, H. 2018.
The effect of the scaling factors
and
on the structure of
in the single-step procedure.
Genet. Sel. Evol. 50:1-9.
# Get A matrix. A <- AGHmatrix::Amatrix(data = ped.pine) A[1:5,1:5] dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9", plots = FALSE)$M.clean # Get G matrix. G <- G.matrix(M = M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G2A. check <- match.G2A( A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE, RMdiff = TRUE) # Align G matrix with A. G_align <- G.tuneup(G = check$Gclean, A = check$Aclean, align = TRUE, sparseform = FALSE)$Gb # Get Ginverse using the aligned G. Ginv <- G.inverse(G = G_align, sparseform = FALSE)$Ginv Ginv[1:5, 1:5] dim(Ginv) # Obtaining H. H <- H.matrix(A = A, G = Ginv, lambda = 0.90, sparseform = FALSE) H[1:5, 1:5]
# Get A matrix. A <- AGHmatrix::Amatrix(data = ped.pine) A[1:5,1:5] dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9", plots = FALSE)$M.clean # Get G matrix. G <- G.matrix(M = M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G2A. check <- match.G2A( A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE, RMdiff = TRUE) # Align G matrix with A. G_align <- G.tuneup(G = check$Gclean, A = check$Aclean, align = TRUE, sparseform = FALSE)$Gb # Get Ginverse using the aligned G. Ginv <- G.inverse(G = G_align, sparseform = FALSE)$Ginv Ginv[1:5, 1:5] dim(Ginv) # Obtaining H. H <- H.matrix(A = A, G = Ginv, lambda = 0.90, sparseform = FALSE) H[1:5, 1:5]
It reports summary statistics, plots and allows for some filter options
for diagonal and off-diagonal elements for a given kinship matrix.
The input matrix can be a pedigree-based
relationship matrix , a genomic relationship matrix
or a
hybrid relationship matrix
.
Individual names should be assigned to
rownames
and colnames
.
kinship.diagnostics( K = NULL, diagonal.thr.large = 1.2, diagonal.thr.small = 0.8, duplicate.thr = 0.95, clean.diagonal = FALSE, clean.duplicate = FALSE, plots = TRUE, sample.plot = 1, message = TRUE )
kinship.diagnostics( K = NULL, diagonal.thr.large = 1.2, diagonal.thr.small = 0.8, duplicate.thr = 0.95, clean.diagonal = FALSE, clean.duplicate = FALSE, plots = TRUE, sample.plot = 1, message = TRUE )
K |
Input of a kinship matrix in full format ( |
diagonal.thr.large |
A threshold value to flag large diagonal values (default = |
diagonal.thr.small |
A threshold value to flag small diagonal values (default = |
duplicate.thr |
A threshold value to flag possible duplicates. Any calculation larger than the
threshold based on
|
clean.diagonal |
If |
clean.duplicate |
If |
plots |
If |
sample.plot |
A numeric value between 0 and 1 indicating the proportion
of the data points to be sampled for fast plotting of off-diagonal values.
Note that for proportions other than 1, the method is not exact and low
proportions are only recommended for large kinship matrices (default = |
message |
If |
A list with the following elements:
list.diagonal
: a data frame with the list of flagged large or small diagonal values.
list.duplicate
: a data frame with the list of possible duplicates.
clean.kinship
: output of kinship matrix filtered without the flagged diagonal
and/or duplicate individuals.
plot.diagonal
: histogram with the distribution of diagonal values from the kinship matrix.
plot.offdiag
: histogram with the distribution of off-diagonal values from kinship matrix.
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G # Diagnose G. G_summary <- kinship.diagnostics( K = G, diagonal.thr.large = 1.3, diagonal.thr.small = 0.7, clean.diagonal = TRUE, duplicate.thr = 0.8, clean.duplicate = TRUE, sample.plot = 0.50) ls(G_summary) dim(G_summary$clean.kinship) G_summary$clean.kinship[1:5, 1:5] G_summary$list.duplicate G_summary$list.diagonal G_summary$plot.diag G_summary$plot.offdiag
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G # Diagnose G. G_summary <- kinship.diagnostics( K = G, diagonal.thr.large = 1.3, diagonal.thr.small = 0.7, clean.diagonal = TRUE, duplicate.thr = 0.8, clean.duplicate = TRUE, sample.plot = 0.50) ls(G_summary) dim(G_summary$clean.kinship) G_summary$clean.kinship[1:5, 1:5] G_summary$list.duplicate G_summary$list.diagonal G_summary$plot.diag G_summary$plot.offdiag
Generates a heatmap with dendrogram based on a provided kinship matrix.
This matrix can be a pedigree relationship matrix , a
genomic relationship matrix
or a hybrid relationship
matrix
.
Individual names should be assigned to
rownames
and colnames
.
It sorts individuals according to dendrogram in both columns and rows.
kinship.heatmap( K = NULL, dendrogram = TRUE, clustering.method = c("hierarchical", "kmeans"), dist.method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), row.label = TRUE, col.label = FALSE )
kinship.heatmap( K = NULL, dendrogram = TRUE, clustering.method = c("hierarchical", "kmeans"), dist.method = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"), row.label = TRUE, col.label = FALSE )
K |
Input of a kinship matrix in full format ( |
dendrogram |
If |
clustering.method |
The clustering method considered for the dendrogram.
Options are: |
dist.method |
The method considered to calculate the distance matrix between
individuals used for hierarchical clustering. Options are: |
row.label |
If |
col.label |
If |
Uses the library superheat
from Barter and Yu (2018) to generate plots.
A plot with the properties specified by the above arguments.
Barter, R.L. and Yu, B. 2018. Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data. J. Comput. Graph. Stat. 27(4):910-922.
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Plot a subset of the individuals. kinship.heatmap(K = G[1:10, 1:10], dendrogram = TRUE, row.label = TRUE, col.label = TRUE)
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Plot a subset of the individuals. kinship.heatmap(K = G[1:10, 1:10], dendrogram = TRUE, row.label = TRUE, col.label = TRUE)
Generates a PCA and summary statistics from a given kinship matrix for
population structure. This matrix
can be a pedigree-based relationship matrix , a genomic
relationship matrix
or a hybrid relationship matrix
. Individual names should be assigned to
rownames
and
colnames
. There is additional output such as plots and other data frames
to be used on other downstream analyses (such as GWAS).
kinship.pca( K = NULL, scale = TRUE, label = FALSE, ncp = 10, groups = NULL, ellipses = FALSE )
kinship.pca( K = NULL, scale = TRUE, label = FALSE, ncp = 10, groups = NULL, ellipses = FALSE )
K |
Input of a kinship matrix in full form ( |
scale |
If |
label |
If |
ncp |
The number of PC dimensions to be shown in the screeplot, and to provide
in the output data frame (default = |
groups |
Specifies a vector of class factor that will be used to define different
colors for individuals in the PCA plot. It must be presented in the same order as the individuals
in the kinship matrix (default = |
ellipses |
If |
It calls function eigen()
to obtain eigenvalues and later generate the PCA and the
factoextra
R package to extract and visualize results.
A list with the following four elements:
eigenvalues
: a data frame with the eigenvalues and its variances associated with each dimension
including only the first ncp
dimensions.
pca.scores
: a data frame with scores (rotated observations on the new components) including
only the first ncp
dimensions.
plot.pca
: a scatterplot with the first two-dimensions (PC1 and PC2) and their scores.
plot.scree
: a barchart with the percentage of variances explained by the ncp
dimensions.
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Perform the PCA. G_pca <- kinship.pca(K = G, ncp = 10) ls(G_pca) G_pca$eigenvalues head(G_pca$pca.scores) G_pca$plot.pca G_pca$plot.scree # PCA plot by family (17 groups). grp <- as.factor(pheno.apple$Family) G_pca_grp <- kinship.pca(K = G, groups = grp, label = FALSE, ellipses = FALSE) G_pca_grp$plot.pca
# Get G matrix. G <- G.matrix(M = geno.apple, method = "VanRaden")$G G[1:5, 1:5] # Perform the PCA. G_pca <- kinship.pca(K = G, ncp = 10) ls(G_pca) G_pca$eigenvalues head(G_pca$pca.scores) G_pca$plot.pca G_pca$plot.scree # PCA plot by family (17 groups). grp <- as.factor(pheno.apple$Family) G_pca_grp <- kinship.pca(K = G, groups = grp, label = FALSE, ellipses = FALSE) G_pca_grp$plot.pca
Assesses a given genomic relationship matrix against the
pedigree relationship matrix
, or vice versa,
to determine the matched and mismatched individuals.
If requested, it provides the cleaned versions containing only the matched individuals
between both matrices. The user should provide the matrices
and
in full form (
and
, respectively).
Individual names should be assigned to
rownames
and colnames
for both matrices.
match.G2A( A = NULL, G = NULL, clean = TRUE, ord = TRUE, mism = FALSE, RMdiff = FALSE, message = TRUE )
match.G2A( A = NULL, G = NULL, clean = TRUE, ord = TRUE, mism = FALSE, RMdiff = FALSE, message = TRUE )
A |
Input of the pedigree relationship matrix |
G |
Input of the genomic relationship matrix |
clean |
If |
ord |
If |
mism |
If |
RMdiff |
If |
message |
If |
A list with the following elements:
Gclean
: a matrix with the portion of containing only matched individuals.
Aclean
: a matrix with the portion of containing only matched individuals.
mismG
: a vector containing the names of the individuals from matrix that are
missing in matrix
.
mismA
: a vector containing the names of the individuals from matrix that are
missing in matrix
.
RM
: a data frame with the observations from both the and
matched matrices, together with their absolute relationship difference.
plotG2A
: scatterplot with the pairing of matched pedigree- against genomic-based
relationship values. This graph might take a long to plot with large datasets.
# Get A matrix. A <- AGHmatrix::Amatrix(data = ped.pine) A[1:5,1:5] dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9", plots = FALSE)$M.clean # Get G matrix. G <- G.matrix(M = M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G2A. check <- match.G2A( A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE, RMdiff = TRUE) ls(check) dim(check$Aclean) dim(check$Gclean) check$Aclean[1:5, 1:5] check$Gclean[1:5, 1:5] head(check$mismG) head(check$mismA) check$plotG2A head(check$RM)
# Get A matrix. A <- AGHmatrix::Amatrix(data = ped.pine) A[1:5,1:5] dim(A) # Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.2, ind.callrate = 0.20, na.string = "-9", plots = FALSE)$M.clean # Get G matrix. G <- G.matrix(M = M.clean, method = "VanRaden", na.string = "-9")$G G[1:5, 1:5] dim(G) # Match G2A. check <- match.G2A( A = A, G = G, clean = TRUE, ord = TRUE, mism = TRUE, RMdiff = TRUE) ls(check) dim(check$Aclean) dim(check$Gclean) check$Aclean[1:5, 1:5] check$Gclean[1:5, 1:5] head(check$mismG) head(check$mismA) check$plotG2A head(check$RM)
Assesses a given kinship matrix against the provided phenotypic data to determine
if all genotypes are in the kinship matrix or not. It also reports which individuals
match or are missing from one set or another.
If requested, a reduced kinship matrix is generated that has only the matched individuals.
The input kinship matrix can be a pedigree-based relationship matrix ,
a genomic-based relationship matrix
, or a hybrid
relationship matrix
.
Individual names should be assigned to
rownames
and colnames
of input matrix.
match.kinship2pheno( K = NULL, pheno.data = NULL, indiv = NULL, clean = FALSE, ord = TRUE, mism = FALSE, message = TRUE )
match.kinship2pheno( K = NULL, pheno.data = NULL, indiv = NULL, clean = FALSE, ord = TRUE, mism = FALSE, message = TRUE )
K |
Input of a kinship matrix in full form ( |
pheno.data |
A data fame with the phenotypic data to assess (for |
indiv |
The string for the column name for genotypes/individuals in the phenotypic data (default = |
clean |
If |
ord |
If |
mism |
If |
message |
If |
A list with the following elements:
mismatchesK
: a vector containing the names of the individuals from the provided kinship matrix
that mismatch with the phenotypic data.
matchesK
: a vector containing the names of the individuals from the provided kinship matrix
that match with the phenotypic data.
mismatchesP
: a vector containing the names of phenotyped individuals
that mismatch with those from the kinship matrix.
matchesP
: a vector containing the names of phenotyped individuals
that match with those from the kinship matrix.
Kclean
: a clean kinship matrix containing only the matched phenotyped individuals.
## Not run: # Get G matrix. G <- G.matrix(M = geno.pine655, method = "VanRaden", na.string = "-9", sparseform = FALSE)$G dim(G) # Match G and the phenotype. check <- match.kinship2pheno( K = G, pheno.data = pheno.pine, indiv = "Genotype", clean = TRUE, mism = TRUE) ls(check) length(check$matchesK) length(check$mismatchesK) length(check$matchesP) length(check$mismatchesP) dim(check$Kclean) ## End(Not run)
## Not run: # Get G matrix. G <- G.matrix(M = geno.pine655, method = "VanRaden", na.string = "-9", sparseform = FALSE)$G dim(G) # Match G and the phenotype. check <- match.kinship2pheno( K = G, pheno.data = pheno.pine, indiv = "Genotype", clean = TRUE, mism = TRUE) ls(check) length(check$matchesK) length(check$mismatchesK) length(check$matchesP) length(check$mismatchesP) dim(check$Kclean) ## End(Not run)
Individual pedigree data for a total of 2,034 records of loblolly pine (Pinus taeda L.). Missing parental information coded as 0. Dataset obtained from supplementary material in Resende et al. (2012).
ped.pine
ped.pine
data.frame
Resende, M.F.R., Munoz, P. Resende, M.D.V., Garrick, D.J., Fernando, R.L., Davis, J.M., Jokela, E.J., Martin, T.A., Peter, G.F., and Kirst, M. 2012. Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.). Genetics 190:1503-1510.
ped.pine |> head()
ped.pine |> head()
Pedigree data of 1,481 Atlantic salmon samples. Missing parental information coded as 0. Dataset obtained from supplementary material in Robledo et al. (2018).
ped.salmon
ped.salmon
data.frame
Robledo D., Matika O., Hamilton A., and Houston R.D. 2018. Genome-wide association and genomic selection for resistance to amoebic gill disease in Atlantic salmon. G3 Genes, Genomes, Genetics 8:1195-1203.
ped.salmon |> head()
ped.salmon |> head()
Phenotypic data on 247 apple clones (i.e., genotypes) evaluated for several fruit quality traits at two New Zealand sites, Motueka (MOT) and Hawkes Bay (HB). Dataset obtained from supplementary material in Kumar et al. (2015).
pheno.apple
pheno.apple
data.frame
Kumar S., Molloy C., Muñoz P., Daetwyler H., Chagné D., and Volz R. 2015. Genome-enabled estimates of additive and nonadditive genetic variances and prediction of apple phenotypes across environments. G3 Genes, Genomes, Genetics 5:2711–2718.
pheno.apple |> head()
pheno.apple |> head()
Deregressed estimated breeding values (DEBV) for the trait diameter at breast height (DBH) at 6 years of age from trees grown at site Nassau. The dataset contains a total of 861 genotypes of loblolly pine (Pinus taeda L.). Dataset obtained from supplementary material in Resende et al. (2012).
pheno.pine
pheno.pine
data.frame
Resende, M.F.R., Munoz, P. Resende, M.D.V., Garrick, D.J., Fernando, R.L., Davis, J.M., Jokela, E.J., Martin, T.A., Peter, G.F., and Kirst, M. 2012. Accuracy of genomic selection methods in a standard data set of loblolly pine (Pinus taeda L.). Genetics 190:1503-1510.
pheno.pine |> head()
pheno.pine |> head()
Phenotypic data on 1,481 Atlantic salmon individuals. All fish were phenotyped for mean gill score (mean of the left gill and right gill scores) and amoebic load (qPCR values using Neoparamoeba perurans specific primers, amplified from one of the gills). Dataset obtained from supplementary material in Robledo et al. (2018).
pheno.salmon
pheno.salmon
data.frame
Robledo D., Matika O., Hamilton A., and Houston R.D. 2018. Genome-wide association and genomic selection for resistance to amoebic gill disease in Atlantic salmon. G3 Genes, Genomes, Genetics 8:1195-1203.
pheno.salmon |> head()
pheno.salmon |> head()
Reads molecular data in the format 0, 1, 2 and performs some basic quality control
filters and simple imputation.
Matrix provided is of the full form (), with
individuals and
markers.
Individual and marker names are assigned to
rownames
and colnames
,
respectively. Filtering can be done with the some of the following options by
specifying thresholds for:
missing values on individuals, missing values on markers, minor allele frequency,
inbreeding Fis value (of markers), and observed heterozygosity (of markers).
String used for identifying missing values can be specified.
If requested, missing values will be imputed based on the mean of each SNP.
qc.filtering( M = NULL, base = FALSE, na.string = NA, map = NULL, marker = NULL, chrom = NULL, pos = NULL, ref = NULL, marker.callrate = 1, ind.callrate = 1, maf = 0, heterozygosity = 1, Fis = 1, impute = FALSE, Mrecode = FALSE, plots = TRUE, digits = 2, message = TRUE )
qc.filtering( M = NULL, base = FALSE, na.string = NA, map = NULL, marker = NULL, chrom = NULL, pos = NULL, ref = NULL, marker.callrate = 1, ind.callrate = 1, maf = 0, heterozygosity = 1, Fis = 1, impute = FALSE, Mrecode = FALSE, plots = TRUE, digits = 2, message = TRUE )
M |
A matrix with SNP data of full form ( |
base |
If |
na.string |
A character that will be interpreted as |
map |
(Optional) A data frame with the map information with |
marker |
A character indicating the name of the column in data frame |
chrom |
A character indicating the name of the column in data frame |
pos |
A character indicating the name of the column in data frame |
ref |
A character indicating the name of the column in the map containing the reference allele for
recoding. If absent, then conversion will be based on the major allele (most frequent).
The marker information of a given individuals with two of the specified major alleles
in |
marker.callrate |
A numerical value between 0 and 1 used to remove SNPs with a rate of missing values equal or larger than this value (default = 1, i.e. no removing). |
ind.callrate |
A numerical value between 0 and 1 used to remove individuals with a rate of missing values equal or larger than this value (default = 1, i.e. no removing). |
maf |
A numerical value between 0 and 1 used to remove SNPs with a Minor Allele Frequency (MAF) below this value (default = 0, i.e. no removing). |
heterozygosity |
A numeric value indicating the maximum value of accepted observed heterozygosity (Ho) (default = 1, i.e. no removing). |
Fis |
A numeric value indicating the maximum value of accepted inbreeding (Fis) following
the equation |
impute |
If |
Mrecode |
If |
plots |
If |
digits |
Set up the number of digits used to round the output matrix (default = 2). |
message |
If |
Warning: The arguments base
, ref
, and Mrecode
currently are deprecated and will
be removed on the next version of ASRgenomics
.
Use function snp.recode to recode the matrix prior to using qc.filtering
.
The filtering process is carried out as expressed in the following simplified pseudo-code that consists on a loop repeated twice:
for i in 1 to 2
Filter markers based on call rate.
Filter individuals based on call rate.
Filter markers based on minor allele frequency.
Filter markers based on observed heterozygosity.
Filter markers based on inbreeding.
end for
A list with the following elements:
M.clean
: the cleaned matrix after the quality control filters have been applied.
map
: if provided, a cleaned map
data frame after the quality control filters have been applied.
plot.missing.ind
: a plot of missing data per individual (original marker matrix).
plot.missing.SNP
: a plot of missing data per SNP (original marker matrix).
plot.heteroz
: a plot of observed heterozygocity per SNP (original marker matrix).
plot.Fis
: a plot of Fis per SNP (original marker matrix).
plot.maf
: a plot of the minor allele frequency (original marker matrix).
# Example: Pine dataset from ASRgenomics (coded as 0,1,2 with missing as -9). M.clean <- qc.filtering( M = geno.pine926, maf = 0.05, marker.callrate = 0.9, ind.callrate = 0.9, heterozygosity = 0.9, Fis = 0.6, na.string = "-9") ls(M.clean) M.clean$M.clean[1:5, 1:5] dim(M.clean$M.clean) head(M.clean$map) M.clean$plot.maf M.clean$plot.missing.ind M.clean$plot.missing.SNP M.clean$plot.heteroz M.clean$plot.Fis # Example: Salmon dataset (coded as 0,1,2 with missing as NA). M.clean <- qc.filtering( M = geno.salmon, maf = 0.02, marker.callrate = 0.10, ind.callrate = 0.20, heterozygosity = 0.9, Fis = 0.4) M.clean$M.clean[1:5, 1:5] dim(M.clean$M.clean) head(M.clean$map) M.clean$plot.maf M.clean$plot.missing.ind M.clean$plot.missing.SNP M.clean$plot.heteroz M.clean$plot.Fis
# Example: Pine dataset from ASRgenomics (coded as 0,1,2 with missing as -9). M.clean <- qc.filtering( M = geno.pine926, maf = 0.05, marker.callrate = 0.9, ind.callrate = 0.9, heterozygosity = 0.9, Fis = 0.6, na.string = "-9") ls(M.clean) M.clean$M.clean[1:5, 1:5] dim(M.clean$M.clean) head(M.clean$map) M.clean$plot.maf M.clean$plot.missing.ind M.clean$plot.missing.SNP M.clean$plot.heteroz M.clean$plot.Fis # Example: Salmon dataset (coded as 0,1,2 with missing as NA). M.clean <- qc.filtering( M = geno.salmon, maf = 0.02, marker.callrate = 0.10, ind.callrate = 0.20, heterozygosity = 0.9, Fis = 0.4) M.clean$M.clean[1:5, 1:5] dim(M.clean$M.clean) head(M.clean$map) M.clean$plot.maf M.clean$plot.missing.ind M.clean$plot.missing.SNP M.clean$plot.heteroz M.clean$plot.Fis
Generates a PCA and summary statistics from a given molecular matrix
for population structure. Matrix
provided is of full form (), with n individuals and p markers. Individual and
marker names are assigned to
rownames
and colnames
, respectively.
SNP data is coded as 0, 1, 2 (integers or decimal numbers). Missing values are
not accepted and these need to be imputed (see function qc.filtering()
for implementing mean imputation). There is additional output such as plots and
other data frames
to be used on other downstream analyses (such as GWAS).
snp.pca(M = NULL, label = FALSE, ncp = 10, groups = NULL, ellipses = FALSE)
snp.pca(M = NULL, label = FALSE, ncp = 10, groups = NULL, ellipses = FALSE)
M |
A matrix with SNP data of full form ( |
label |
If |
ncp |
The number of PC dimensions to be shown in the screeplot, and to provide
in the output data frame (default = |
groups |
Specifies a vector of class factor that will be used to define different
colors for individuals in the PCA plot. It must be presented in the same order as the individuals
in the molecular |
ellipses |
If |
It calls function prcomp()
to generate the PCA and the
factoextra
R package to extract and visualize results.
Methodology uses normalized allele frequencies as proposed by Patterson et al. (2006).
A list with the following four elements:
eigenvalues
: a data frame with the eigenvalues and its variances associated with each dimension
including only the first ncp
dimensions.
pca.scores
: a data frame with scores (rotated observations on the new components) including
only the first ncp
dimensions.
plot.pca
: a scatterplot with the first two-dimensions (PC1 and PC2) and their scores.
plot.scree
: a barchart with the percentage of variances explained by the ncp
dimensions.
Patterson N., Price A.L., and Reich, D. 2006. Population structure and eigenanalysis. PLoS Genet 2(12):e190. doi:10.1371/journal.pgen.0020190
# Perform the PCA. SNP_pca <- snp.pca(M = geno.apple, ncp = 10) ls(SNP_pca) SNP_pca$eigenvalues head(SNP_pca$pca.scores) SNP_pca$plot.pca SNP_pca$plot.scree # PCA plot by family (17 groups). grp <- as.factor(pheno.apple$Family) SNP_pca_grp <- snp.pca(M = geno.apple, groups = grp, label = FALSE) SNP_pca_grp$plot.pca
# Perform the PCA. SNP_pca <- snp.pca(M = geno.apple, ncp = 10) ls(SNP_pca) SNP_pca$eigenvalues head(SNP_pca$pca.scores) SNP_pca$plot.pca SNP_pca$plot.scree # PCA plot by family (17 groups). grp <- as.factor(pheno.apple$Family) SNP_pca_grp <- snp.pca(M = geno.apple, groups = grp, label = FALSE) SNP_pca_grp$plot.pca
For a given molecular dataset (in the format 0, 1 and 2)
it produces a reduced molecular matrix by eliminating "redundant"
markers using pruning techniques. This function finds and drops some of the
SNPs in high linkage disequilibrium (LD).
snp.pruning( M = NULL, map = NULL, marker = NULL, chrom = NULL, pos = NULL, method = c("correlation"), criteria = c("callrate", "maf"), pruning.thr = 0.95, by.chrom = FALSE, window.n = 50, overlap.n = 5, iterations = 10, seed = NULL, message = TRUE )
snp.pruning( M = NULL, map = NULL, marker = NULL, chrom = NULL, pos = NULL, method = c("correlation"), criteria = c("callrate", "maf"), pruning.thr = 0.95, by.chrom = FALSE, window.n = 50, overlap.n = 5, iterations = 10, seed = NULL, message = TRUE )
M |
A matrix with marker data of full form ( |
map |
(Optional) A data frame with the map information with |
marker |
A character indicating the name of the column in data frame |
chrom |
A character indicating the name of the column in data frame |
pos |
A character indicating the name of the column in data frame |
method |
A character indicating the method (or algorithm) to be used as reference for
identifying redundant markers.
The only method currently available is based on correlations (default = |
criteria |
A character indicating the criteria to choose which marker to drop
from a detected redundant pair.
Options are: |
pruning.thr |
A threshold value to identify redundant markers with Pearson's correlation larger than the
value provided (default = |
by.chrom |
If TRUE the pruning is performed independently by chromosome (default = |
window.n |
A numeric value with number of markers to consider in each
window to perform pruning (default = |
overlap.n |
A numeric value with number of markers to overlap between consecutive windows
(default = |
iterations |
An integer indicating the number of sequential times the pruning procedure
should be executed on remaining markers.
If no markers are dropped in a given iteration/run, the algorithm will stop (default = |
seed |
An integer to be used as seed for reproducibility. In case the criteria has the
same values for a given pair of markers, one will be dropped at random (default = |
message |
If |
Pruning is recommended as redundancies can affect the quality of matrices used for downstream analyses. The algorithm used is based on the Pearson's correlation between markers as a proxy for LD. In the event of a pairwise correlation higher than the selected threshold markers will be eliminated as specified by: call rate, minor allele frequency. In case of tie, one marker will be dropped at random.
Filtering markers (qc.filtering) is of high relevance before pruning. Poor quality markers (e.g., monomorphic markers) may prevent correlations from being calculated and may affect eliminations.
Mpruned
: a matrix containing the pruned marker M matrix.
map
: an data frame containing the pruned map.
# Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.20, ind.callrate = 0.20, Fis = 1, heterozygosity = 0.98, na.string = "-9", plots = FALSE)$M.clean # Prune correlations > 0.9. Mpr <- snp.pruning( M = M.clean, pruning.thr = 0.90, by.chrom = FALSE, window.n = 40, overlap.n = 10) head(Mpr$map) Mpr$Mpruned[1:5, 1:5]
# Read and filter genotypic data. M.clean <- qc.filtering( M = geno.pine655, maf = 0.05, marker.callrate = 0.20, ind.callrate = 0.20, Fis = 1, heterozygosity = 0.98, na.string = "-9", plots = FALSE)$M.clean # Prune correlations > 0.9. Mpr <- snp.pruning( M = M.clean, pruning.thr = 0.90, by.chrom = FALSE, window.n = 40, overlap.n = 10) head(Mpr$map) Mpr$Mpruned[1:5, 1:5]
Reads molecular data in format of bi-allelic nucleotide bases (AA,
AG, GG, CC, etc.) and recodes them as 0, 1, 2 and NA
to be used in other
downstream analyses.
snp.recode( M = NULL, map = NULL, marker = NULL, ref = NULL, alt = NULL, recoding = c("ATGCto012"), na.string = NA, rename.markers = TRUE, message = TRUE )
snp.recode( M = NULL, map = NULL, marker = NULL, ref = NULL, alt = NULL, recoding = c("ATGCto012"), na.string = NA, rename.markers = TRUE, message = TRUE )
M |
A character matrix with SNP data of full form ( |
map |
(Optional) A data frame with the map information with |
marker |
A character indicating the name of the column in data frame |
ref |
A character indicating the name of the column in the map containing the reference allele for
recoding. If absent, then conversion will be based on the major allele (most frequent).
The marker information of a given individual with two of the specified major alleles
in |
alt |
A character indicating the name of the column in the map containing the alternative allele for
recoding. If absent, then it will be inferred from the data. The marker information of a given individual
with two of the specified alleles in |
recoding |
A character indicating the recoding option to be performed.
Currently, only the nucleotide bases (AA, AG, ...) to allele count is available ( |
na.string |
A character that is interpreted as missing values (default = |
rename.markers |
If |
message |
If |
A list with the following two elements:
Mrecode
: the molecular matrix recoded to 0, 1, 2 and
NA
.
mapr
: the data frame with the map information including reference allele and alternative allele.
# Create bi-allelic base data set. Mnb <- matrix(c( "A-", NA, "GG", "CC", "AT", "CC", "AA", "AA", "AAA", NA, "GG", "AC", "AT", "CG", "AA", "AT", "AA", NA, "GG", "CC", "AA", "CG", "AA", "AA", "AA", NA, "GG", "AA", "AA", NA, "AA", "AA", "AT", NA, "GG", "AA", "TT", "CC", "AT", "TT", "AA", NA, NA, "CC", NA, "GG", "AA", "AA", "AA", NA, NA, "CC", "TT", "CC", "AA", "AT", "TT", NA, "GG", "AA", "AA", "CC", "AA", "AA"), ncol = 8, byrow = TRUE, dimnames = list(paste0("ind", 1:8), paste0("m", 1:8))) Mnb # Recode without map (but map is created). Mr <- snp.recode(M = Mnb, na.string = NA) Mr$Mrecode Mr$map # Create map. mapnb <- data.frame( marker = paste0("m", 1:8), reference = c("A", "T", "G", "C", "T", "C", "A", "T"), alternative = c("T", "G", "T", "A", "A", "G", "T", "A") ) mapnb # Recode with map without alternative allele. Mr <- snp.recode(M = Mnb, map = mapnb, marker = "marker", ref = "reference", na.string = NA, rename.markers = TRUE) Mr$Mrecode Mr$map # Notice that the alternative allele is in the map as a regular variable, # but in the names it is inferred from data (which might be 0 (missing)). # Recode with map with alternative allele. Mr <- snp.recode(M = Mnb, map = mapnb, marker = "marker", ref = "reference", alt = "alternative", na.string = NA, rename.markers = TRUE) Mr$Mrecode Mr$map # Now the alternative is also on the names. # We can also recode without renaming the markers. Mr <- snp.recode(M = Mnb, map = mapnb, marker = "marker", ref = "reference", na.string = NA, rename.markers = FALSE) Mr$Mrecode Mr$map # Now the alternative is also on the names.
# Create bi-allelic base data set. Mnb <- matrix(c( "A-", NA, "GG", "CC", "AT", "CC", "AA", "AA", "AAA", NA, "GG", "AC", "AT", "CG", "AA", "AT", "AA", NA, "GG", "CC", "AA", "CG", "AA", "AA", "AA", NA, "GG", "AA", "AA", NA, "AA", "AA", "AT", NA, "GG", "AA", "TT", "CC", "AT", "TT", "AA", NA, NA, "CC", NA, "GG", "AA", "AA", "AA", NA, NA, "CC", "TT", "CC", "AA", "AT", "TT", NA, "GG", "AA", "AA", "CC", "AA", "AA"), ncol = 8, byrow = TRUE, dimnames = list(paste0("ind", 1:8), paste0("m", 1:8))) Mnb # Recode without map (but map is created). Mr <- snp.recode(M = Mnb, na.string = NA) Mr$Mrecode Mr$map # Create map. mapnb <- data.frame( marker = paste0("m", 1:8), reference = c("A", "T", "G", "C", "T", "C", "A", "T"), alternative = c("T", "G", "T", "A", "A", "G", "T", "A") ) mapnb # Recode with map without alternative allele. Mr <- snp.recode(M = Mnb, map = mapnb, marker = "marker", ref = "reference", na.string = NA, rename.markers = TRUE) Mr$Mrecode Mr$map # Notice that the alternative allele is in the map as a regular variable, # but in the names it is inferred from data (which might be 0 (missing)). # Recode with map with alternative allele. Mr <- snp.recode(M = Mnb, map = mapnb, marker = "marker", ref = "reference", alt = "alternative", na.string = NA, rename.markers = TRUE) Mr$Mrecode Mr$map # Now the alternative is also on the names. # We can also recode without renaming the markers. Mr <- snp.recode(M = Mnb, map = mapnb, marker = "marker", ref = "reference", na.string = NA, rename.markers = FALSE) Mr$Mrecode Mr$map # Now the alternative is also on the names.
Modifies the input sparse form matrix into its full form.
The sparse form has three columns per line, corresponding to the set:
Row, Col, Value
, and is defined by a lower triangle row-wise
of the full matrix and is sorted as columns within row.
Individual names should be assigned as attributes: attr(K, "rowNames")
and attr(K, "colNames")
. If these are not provided they are considered
as 1 to .
sparse2full(K = NULL)
sparse2full(K = NULL)
K |
A square matrix in sparse form (default = |
Based on a function from ASReml-R 3 library by Butler et al. (2009).
A full square matrix where individual names are assigned to
rownames
and colnames
.
If attribute INVERSE
is found this is also passed to the full matrix.
Butler, D.G., Cullis, B.R., Gilmour, A.R., and Gogel, B.J. 2009. ASReml-R reference manual. Version 3. The Department of Primary Industries and Fisheries (DPI&F).
# Get G matrix. Gsp <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = TRUE)$G.sparse head(Gsp) head(attr(Gsp, "rowNames")) # Transform into full matrix. G <- sparse2full(K = Gsp) G[1:5, 1:5]
# Get G matrix. Gsp <- G.matrix(M = geno.apple, method = "VanRaden", sparseform = TRUE)$G.sparse head(Gsp) head(attr(Gsp, "rowNames")) # Transform into full matrix. G <- sparse2full(K = Gsp) G[1:5, 1:5]
This function generates (or imputes) a molecular matrix for offspring from hypothetical crosses based on the genomic information from the parents. This is a common procedure in species such as maize, where only the parents (inbred lines) are genotyped, and this information is used to generate/impute the genotypic data of each of the hybrid offspring. This function can be also used for bulked DNA analyses, in order to obtain an bulked molecular matrix for full-sib individuals were only parents are genotyped.
synthetic.cross( M = NULL, ped = NULL, indiv = NULL, mother = NULL, father = NULL, heterozygote.action = c("useNA", "exact", "fail", "expected"), na.action = c("useNA", "expected"), message = TRUE )
synthetic.cross( M = NULL, ped = NULL, indiv = NULL, mother = NULL, father = NULL, heterozygote.action = c("useNA", "exact", "fail", "expected"), na.action = c("useNA", "expected"), message = TRUE )
M |
A matrix with marker data of full form ( |
ped |
A data frame with three columns containing only the pedigree of the hypothetical offspring.
(not pedigree of parents)
It should include the three columns for individual, mother and father (default = |
indiv |
A character indicating the column in |
mother |
A character indicating the column in |
father |
A character indicating the column in |
heterozygote.action |
Indicates the action to take when heterozygotes are found in a marker.
Options are: |
na.action |
Indicates the action to take when missing values are found in a marker.
Options are: |
message |
If |
For double-haploids, almost the totality of the markers (except for genotyping errors)
will be homozygotic reads. But in many other cases (including recombinant inbred lines)
there will be a proportion of heterozygotic reads. In these case, it is
very difficult to infer (impute) the exact genotype of a given offspring individual.
For example, if parents are 0 (AA) and 1 (AC) then offsprings will differ given this
Mendelian sampling. However, different strategies exist to determine the
expected value for that specific cross (if required), which are detailed below
using the option heterozygote.action
.
If heterozygote.action = "useNA"
,
the generated offspring will have, for the heterozygote read, an NA
,
and no markers are removed.
Hence, no attempt will be done to impute/estimate this value.
If heterozygote.action = "exact"
,
any marker containing one or more heterozygote reads will be removed.
Hence, inconsistent markers are fully removed from the matrix.
If heterozygote.action = "fail"
,
function stops and informs of the presence of heterozygote reads.
If heterozygote.action = "expected"
,
then an algorithm is implemented, on the heterozygote read to determine its
expected value. For example, if parents are 0 and 1, then the expected value
(with equal probability) is 0.5. For a cross between two heterozygotes,
the expected value is: . And for a cross
between 1 and 2, the expected value is:
Missing value require special treatment, and an imputation strategy is detailed
below as indicated using the option na.action
.
If na.action = "useNA"
, if at least one of the parental reads
is missing values for a given marker then it will be assigned as missing for
the hypothetical cross. Hence, no attempt will be done to impute/estimate
this value.
If na.action = "expected"
, then an algorithm is implemented that
will impute the expected read of the cross if the genotype of one of
the parents is missing (e.g., cross between 0 and NA). Calculations
are based on parental allelic frequencies and
for the given
marker. The expressions for expected values are detailed below.
If the genotype of the non-missing parent read is 0.
(probability that the missing parent is 0) x 0 (expected value of the offspring from a 0 x 0 cross:
) +
(probability that the missing parent is 1) x 0.5 (expected offspring from a 0 x 1 cross:
) +
(probability that the missing parent is 2) x 1 (expected offspring from a 0 x 2 cross:
)
If the genotype of the non-missing parent read is 1.
(probability that the missing parent is 0) x 0.5 (offspring:
) +
(probability that the missing parent is 1) x 1 (offspring:
) +
(probability that the missing parent is 2) x 1.5 (offspring:
)
If the genotype of the non-missing parent read is 2.
(probability that the missing parent is 0) x 1 (offspring:
) +
(probability that the missing parent is 1) x 1.5 (offspring:
) +
(probability that the missing parent is 2) x 2 (offspring:
)
Similarly, the calculation of the expected read of a cross when both parents are missing is also based on population allelic frequencies for the given marker. The expressions for expected values are detailed below.
(probability that both parents are 0) x 0 (expected value of the offspring from a 0 x 0 cross: 0(1/1)) +
(probability that the first parent is 0 and the second is 1; this requires
the multiplication by 2 because it is also possible that the first parent is 1 and the second is 0)
x 0.5 (offspring:
) +
(this could be 0 x 2 or 2 x 0) x 1 (offspring:
) +
(both parents are 1) x 1 (offspring:
) +
(this could be 1 x 2 or 2 x 1) x 1.5 (offspring:
) +
(both parents are 2) x 2 (offspring:
)
Note that the use of na.action = "expected"
is recommended when
a large number of offspring will conform the hybrid cross (such as
with bulked DNA analyses) for family groups with reasonable number of individuals.
Warning. If "expected"
is used for heterozygote.action
or na.action
,
direct transformation of the molecular data to other codings (e.g.,
dominance matrix coded as c(0,1,0)
) is not recommended.
A molecular matrix containing the genotypes generated/imputed for the
hypothetical cross.
# Create dummy pedigree (using first 10 as parents). ped <- data.frame( male = rownames(geno.apple)[1:5], female = rownames(geno.apple)[6:10]) ped$offs <- paste(ped$male, ped$female, sep = "_") ped # Select portion of M for parents. Mp <- geno.apple[c(ped$male, ped$female), 1:15] # Get genotype of crosses removing markers with heterozygotes. synthetic.cross( M = Mp, ped = ped, indiv = "offs", mother = "female", father = "male", heterozygote.action = "exact", na.action = "useNA") # Request the synthetic cross to be NA in the respective samples. synthetic.cross( M = Mp, ped = ped, indiv = "offs", mother = "female", father = "male", heterozygote.action = "useNA", na.action = "useNA") # Get genotype of crosses and use expected values. synthetic.cross( M = Mp, ped = ped, indiv = "offs", mother = "female", father = "male", heterozygote.action = "expected", na.action = "expected")
# Create dummy pedigree (using first 10 as parents). ped <- data.frame( male = rownames(geno.apple)[1:5], female = rownames(geno.apple)[6:10]) ped$offs <- paste(ped$male, ped$female, sep = "_") ped # Select portion of M for parents. Mp <- geno.apple[c(ped$male, ped$female), 1:15] # Get genotype of crosses removing markers with heterozygotes. synthetic.cross( M = Mp, ped = ped, indiv = "offs", mother = "female", father = "male", heterozygote.action = "exact", na.action = "useNA") # Request the synthetic cross to be NA in the respective samples. synthetic.cross( M = Mp, ped = ped, indiv = "offs", mother = "female", father = "male", heterozygote.action = "useNA", na.action = "useNA") # Get genotype of crosses and use expected values. synthetic.cross( M = Mp, ped = ped, indiv = "offs", mother = "female", father = "male", heterozygote.action = "expected", na.action = "expected")