Title: | Variant Quality Investigation Helper |
---|---|
Description: | Imports Variant Calling Format file into R. It can detect whether a sample contains contaminant from the same species. In the first stage of the approach, a change-point detection method is used to identify copy number variations for filtering. Next, features are extracted from the data for a support vector machine model. For log-likelihood calculation, the deviation parameter is estimated by maximum likelihood method. Using a radial basis function kernel support vector machine, the contamination of a sample can be detected. |
Authors: | Tao Jiang [aut, cre] |
Maintainer: | Tao Jiang <[email protected]> |
License: | GPL-2 |
Version: | 1.0.0 |
Built: | 2024-10-27 03:48:51 UTC |
Source: | https://github.com/cran/vanquish |
A dataframe containing default parameters.
config_df
config_df
A data frame with 12 variables:
threshold
Threshold for allele frequency
skew
Skewness for allele frequency
lower
Lower bound for allele frequency region
upper
Upper bound for allele frequency region
ldpthred
Threshold to determine low depth
hom_mle
Hom MLE of p in Beta-Binomial model
het_mle
Het MLE of p in Beta-Binomial model
Hom_thred
Threshold between hom and high
High_thred
Threshold between high and het
Het_thred
Threshold between het and low
hom_rho
Hom MLE of rho in Beta-Binomial model
het_rho
Het MLE of rho in Beta-Binomial model
Created by Tao Jiang
Detects whether a sample is contaminated another sample of its same species. The input file should be in vcf format.
defcon(file, rmCNV = FALSE, cnvobj = NULL, config = NULL, class_model = NULL, regression_model = NULL)
defcon(file, rmCNV = FALSE, cnvobj = NULL, config = NULL, class_model = NULL, regression_model = NULL)
file |
VCF input object |
rmCNV |
Remove CNV regions, default is FALSE |
cnvobj |
CNV object, default is NULL |
config |
config information of parameters. A default set is generated as part of the model and is included in a model object, which contains |
class_model |
An SVM classification model |
regression_model |
An SVM regression model |
A list containing (1) stat: a data frame with all statistics for contamination estimation; (2) result: contamination estimation (Class = 0, pure; Class = 1, contaminated)
data(vcf_example) result <- defcon(file = vcf_example)
data(vcf_example) result <- defcon(file = vcf_example)
Generates features from each pair of input VCF objects for training contamination detection model.
generate_feature(file, hom_p = 0.999, het_p = 0.5, hom_rho = 0.005, het_rho = 0.1, mixture, homcut = 0.99, highcut = 0.7, hetcut = 0.3)
generate_feature(file, hom_p = 0.999, het_p = 0.5, hom_rho = 0.005, het_rho = 0.1, mixture, homcut = 0.99, highcut = 0.7, hetcut = 0.3)
file |
VCF input object |
hom_p |
The initial value for p in Homozygous Beta-Binomial model, default is 0.999 |
het_p |
The initial value for p in Heterozygous Beta-Binomial model, default is 0.5 |
hom_rho |
The initial value for rho in Homozygous Beta-Binomial model, default is 0.005 |
het_rho |
The initial value for rho in Heterozygous Beta-Binomial model, default is 0.1 |
mixture |
A vector of whether the sample is contaminated: 0 for pure; 1 for contaminated |
homcut |
Cutoff allele frequency value between hom and high, default is 0.99 |
highcut |
Cutoff allele frequency value between high and het, default is 0.7 |
hetcut |
Cutoff allele frequency value between het and low, default is 0.3 |
A data frame with all features for training model of contamination detection
Second alternative allele percentage
getAlt2(f)
getAlt2(f)
f |
Input raw file |
Percent of the second alternative allele
Annotation rate
getAnnoRate(f)
getAnnoRate(f)
f |
Input raw file |
Percentage of annotation locus
Calculate average log-likelihood
getAvgLL(df, hom_mle, het_mle, hom_rho, het_rho)
getAvgLL(df, hom_mle, het_mle, hom_rho, het_rho)
df |
Input modified file |
hom_mle |
Hom MLE of p in Beta-Binomial model, default is 0.9981416 from NA12878_1_L5 |
het_mle |
Het MLE of p in Beta-Binomial model, default is 0.4737897 from NA12878_1_L5 |
hom_rho |
Hom MLE of rho in Beta-Binomial model, default is 0.04570275 from NA12878_1_L5 |
het_rho |
Het MLE of rho in Beta-Binomial model, default is 0.02224098 from NA12878_1_L5 |
meanLL
Low depth percentage
getLowDepth(f, ldpthred)
getLowDepth(f, ldpthred)
f |
Input raw file |
ldpthred |
Threshold to determine low depth, default is 20 |
Percentage of low depth
Get the ratio of allele frequencies with a region
getRatio(subdf, lower, upper)
getRatio(subdf, lower, upper)
subdf |
Dataframe with calculated statistics |
lower |
Lower bound for allele frequency region |
upper |
Upper bound for allele frequency region |
Ratio of allele frequencies with a region
Get absolute value of skewness
getSkewness(subdf)
getSkewness(subdf)
subdf |
Input dataframe |
Absolute value of skewness
SNV percentage
getSNVRate(df)
getSNVRate(df)
df |
Input raw file |
Percentage of SNV
Calculate zygosity variable
getVar(df, state, hom_mle, het_mle)
getVar(df, state, hom_mle, het_mle)
df |
Input modified file |
state |
Zygosity state |
hom_mle |
MLE in hom model |
het_mle |
MLE in het model |
Zygosity variable
Check input filename
locateFile(fn, extension)
locateFile(fn, extension)
fn |
Exact full file name of input file, including directory |
extension |
Expected input file extension: vcf & txt |
Valid directory
Calculates negative log likelihood for beta binomial distribution.
negll(x, size, prob, rho)
negll(x, size, prob, rho)
x |
Depth of alternative allele |
size |
Total depth |
prob |
Theoretical probability for heterozygous is 0.5, for homozygous is 0.999 |
rho |
Rho parameter of Beta-Binomial distribution of alternative allele |
Reads a file in vcf or vcf.gz file and creates a list containing Content, Meta, VCF and file_sample_name
read_vcf(fn, vcffor, dbOnly = FALSE, depCut = FALSE, thred = 20, metaline = 200, extnum = 10, keepall = TRUE, filter = FALSE)
read_vcf(fn, vcffor, dbOnly = FALSE, depCut = FALSE, thred = 20, metaline = 200, extnum = 10, keepall = TRUE, filter = FALSE)
fn |
Input vcf file name |
vcffor |
Input vcf data format: 1) GATK; 2) VarPROWL; 3) VarDict; 4) strelka2 |
dbOnly |
Use dbSNP as filter, default is FALSE |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
metaline |
Number of head lines to read in (better to be large enough), the lines will be checked if they contain meta information, default is 200 |
extnum |
The column number to be extracted from vcf, default is 10; 0 for not extracting any column; extnum should be between 10 and total column number |
keepall |
Keep unextracted column in output, default is TRUE |
filter |
Whether to select "PASS" variants for analyses if they contain unfiltered variants, default is FALSE |
A list containing (1) Content: a vector showing what is contained; (2) Meta: a data frame containing meta-information of the file; (3) VCF: a data frame, the main part of VCF file; (4) file_sample_name: the file name and sample name, in case when multiple samples exist in one file, file and sample names might be different
file.name <- system.file("extdata", "example.vcf.gz", package = "vanquish") example <- read_vcf(fn=file.name, vcffor="VarPROWL")
file.name <- system.file("extdata", "example.vcf.gz", package = "vanquish") example <- read_vcf(fn=file.name, vcffor="VarPROWL")
Read in input vcf data in GATK format for Contamination detection
readGATK(dr, dbOnly, depCut, thred, content, extnum, keepall)
readGATK(dr, dbOnly, depCut, thred, content, extnum, keepall)
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number or numbers to be extracted from vcf, default is 10; 0 for not extracting any columns |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Dataframe from VCF file
Read in input vcf data in strelka2 format for Contamination detection
readStrelka(dr, dbOnly, depCut, thred, content, extnum, keepall)
readStrelka(dr, dbOnly, depCut, thred, content, extnum, keepall)
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number or numbers to be extracted from vcf, default is 10; 0 for not extracting any columns |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Dataframe from VCF file
Read in input vcf data in VarDict format for Contamination detection
readVarDict(dr, dbOnly, depCut, thred, content, extnum, keepall)
readVarDict(dr, dbOnly, depCut, thred, content, extnum, keepall)
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number to be extracted from vcf, default is 10; 0 for not extracting any column |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
Dataframe from VCF file
Read in input vcf data in VarPROWL format
readVarPROWL(dr, dbOnly, depCut, thred, content, extnum, keepall)
readVarPROWL(dr, dbOnly, depCut, thred, content, extnum, keepall)
dr |
A valid input object |
dbOnly |
Use dbSNP as filter, default is FALSE, passed from read_vcf |
depCut |
Use a threshold for min depth , default is False |
thred |
Threshold for min depth, default is 20 |
content |
Column names in VCF files |
extnum |
The column number or numbers to be extracted from vcf, default is 10; 0 for not extracting any columns |
keepall |
Keep unextracted column in output, default is TRUE, passed from read_vcf |
vcf Dataframe from VCF file
Estimates Rho parameter in beta binomial distribution for alternative allele frequency
rho_est(vl)
rho_est(vl)
vl |
A list of vcf objects from read_vcf function. |
A list containing (1) het_rho: Rho parameter of heterozygous location; (2) hom_rho: Rho parameter homozygous location;
data("vcf_example") vcf_list <- list() vcf_list[[1]] <- vcf_example$VCF res <- rho_est(vl = vcf_list) res$het_rho[[1]]$par res$hom_rho[[1]]$par
data("vcf_example") vcf_list <- list() vcf_list[[1]] <- vcf_example$VCF res <- rho_est(vl = vcf_list) res$het_rho[[1]]$par res$hom_rho[[1]]$par
Remove CNV regions within VCF files by change point method
rmChangePoint(vcf, threshold, skew, lower, upper)
rmChangePoint(vcf, threshold, skew, lower, upper)
vcf |
Input VCF files |
threshold |
Threshold for allele frequency |
skew |
Skewness for allele frequency |
lower |
Lower bound for allele frequency region |
upper |
Upper bound for allele frequency region |
VCF object without change point region
Remove CNV regions within VCF files given cnv file
rmCNVinVCF(vcf, cnvobj)
rmCNVinVCF(vcf, cnvobj)
vcf |
Input VCF files |
cnvobj |
cnv object |
VCF object without change point region
Summarizes allele frequency information in scatter and density plots
summary_vcf(vcf, ZG = NULL, CHR = NULL)
summary_vcf(vcf, ZG = NULL, CHR = NULL)
vcf |
VCF object from read_vcf function |
ZG |
zygosity: (1) null, for both het and hom, default; (2) het; (3) hom |
CHR |
chromosome number: (1) null, all chromosome, default; (2) any specific number |
A list containing (1) scatter: allele frequency scatter plot; (2) density: allele frequency density plot
data("vcf_example") tmp <- summary_vcf(vcf = vcf_example, ZG = 'het', CHR = c(1,2)) plot(tmp$scatter) plot(tmp$density)
data("vcf_example") tmp <- summary_vcf(vcf = vcf_example, ZG = 'het', CHR = c(1,2)) plot(tmp$scatter) plot(tmp$density)
An svm object containing default svm classification model.
svm_class_model
svm_class_model
An svm object:
Created by Tao Jiang
An svm object containing default svm regression model.
svm_regression_model
svm_regression_model
An svm object:
Created by Tao Jiang
Trains two SVM models (classification and regression) to detects whether a sample is contaminated another sample of its same species.
train_ct(feature)
train_ct(feature)
feature |
Feature list objects from generate_feature() |
A list contains two trained svm models: regression & classification
Remove CNV regions within VCF files
update_vcf(rmCNV = FALSE, vcf, cnvobj = NULL, threshold = 0.1, skew = 0.5, lower = 0.45, upper = 0.55)
update_vcf(rmCNV = FALSE, vcf, cnvobj = NULL, threshold = 0.1, skew = 0.5, lower = 0.45, upper = 0.55)
rmCNV |
Remove CNV regions, default is FALSE |
vcf |
Input VCF files |
cnvobj |
cnv object, default is NULL |
threshold |
Threshold for allele frequency, default is 0.1 |
skew |
Skewness for allele frequency, default is 0.5 |
lower |
Lower bound for allele frequency region, default is 0.45 |
upper |
Upper bound for allele frequency region, default is 0.55 |
VCF file without CNV region
An example containing a list of 4 data frames.
vcf_example
vcf_example
A list of 4 data frames:
Created by Tao Jiang