Variant Calling for Herpes Simplex Virus 1 from Patient Sample Using Capture Probe Sequencing

gene_x 0 like s 548 view s

Tags: processing, pipeline

  1. The following data-cleaning strategies were applied before variant calling:

    1. * Note that the intrahost results does not contain interhost variants.
    2. * Using the two methods (freebayes and spandx for reporting interhost variant), using viral-ngs reporting intrahost variants.
    3. * The interhost results is released in the point 4 and the intrahost results released in the step 13.
    4. * Merge the two results, delete the items from intrahost variants if it occurs in the interhost tables.
    5. * A records in intrahost table in which the frequency in an isolate >= 0.5 while in an other isolate < 0.5 should be in interhost table. If both are >= 0.5, or < 0.5 should be not in the interhost table.
    6. * We can roughly check if the correctness of the intrahost variant results with the table from point 18 generated by "~/Scripts/check_sequence_differences.py aligned_1_.aln".
    7. * At the end, we should have a interhost variant calling table + a intrahost varicant calling table in which the frequency varies between 0.05 and 0.5.
    8. * Another control method: merge the two tables and sort according to the coordinate. Then compare the coordinate with results ~/Scripts/check_sequence_differences.py aligned_1_.aln. They should be similar.
    9. * If a record occurs in the table from point 18, not in intrahost+interhost table, meaning the base is wrongly called during the assembly.
    10. * The correction of the assembly in the step data/02_assembly and data/03_multialign_to_ref/aligned_1.fasta is actually not very critical, since if a wrongly called base, the intrahost will be a record >= 0.5 frequency.
    11. * The error earlier: only report the intrahost variant calling, few interhost variant calling. The interhost variant calling will be found if they the bases in the assembly is wrongly called.
    12. * !!!! IMPORTANT: Delete the last position of the alleles if there are three in the alleles in the intrahost Excel-table before releasing !!!!
    13. #Report the interhost+intrahost results to Nicole.
    • Input BAM File
      • The original BAM file (e.g., HSV-Klinik_S2.raw.bam) was used as the initial input.
    • BMTagger Depletion
      • BMTagger was employed to remove reads matching contaminants, using specified databases. The resulting file (e.g., HSV-Klinik_S2.bmtagger_depleted.bam) is expected to have potential contaminants removed.
      • Databases Used:
        • metagenomics_contaminants_v3: Contains sequences commonly found as contaminants in metagenomic samples.
        • GRCh38_ncRNA-GRCh38_transcripts-HS_rRNA_mitRNA: A comprehensive database of human RNA, including non-coding RNA (ncRNA), transcripts, ribosomal RNA (rRNA), and mitochondrial RNA (mitRNA).
        • hg38: A version of the human genome.
    • Duplicate Removal
      • Duplicates were removed, producing a file (e.g., HSV-Klinik_S2.rmdup.bam) with reduced PCR duplicates, which helps decrease bias in downstream analyses.
    • BLASTn Depletion
      • After duplicate removal, BLASTn was used to further refine the file by removing remaining non-target reads. The output (e.g., HSV-Klinik_S2.cleaned.bam) should be free of contaminants and duplicates.
      • Databases Used (blastDbs):
        • hybsel_probe_adapters: Contains sequences for hybrid selection probe adapters and synthetic sequences used in sequencing and library preparation.
        • metag_v3.ncRNA.mRNA.mitRNA.consensus: A curated database of consensus sequences, including non-coding RNA, mRNA, and mitochondrial RNA.
    • Taxonomic Filtering
      • HSV-1-specific sequences were isolated by filtering with a custom database of 161 complete HSV-1 genomes from GenBank (see the end of this email). The tool last was used (documentation: [https://docs.hpc.qmul.ac.uk/apps/bio/last/]), producing the taxfiltBam file (e.g., HSV-Klinik_S2.taxfilt.bam).
    • Assembly with Taxonomically Filtered Reads
    • Precise Mapping
      • Using the aligner novoalign with alignment options -r Random -l 20 -g 40 -x 20 -t 100 -k, I created a file (HSV-Klinik_S2.mapped.bam) containing reads aligned to themselves.
        1. Read Counts for BAM Files:
        2. File Read Count
        3. HSV1_S1.raw.bam 1,816,139 × 2
        4. HSV1_S1.bmtagger_depleted.bam 1,750,387 × 2
        5. HSV1_S1.rmdup.bam 1,278,873 × 2
        6. HSV1_S1.cleaned.bam 664,544 × 2
        7. HSV1_S1.taxfilt.bam 22,841 × 2
        8. HSV1_S1.mapped.bam 131 × 2
        9. HSV-Klinik_S2.raw.bam 2,709,058 × 2
        10. HSV-Klinik_S2.bmtagger_depleted.bam 1,582,923 × 2
        11. HSV-Klinik_S2.rmdup.bam 595,066 × 2
        12. HSV-Klinik_S2.cleaned.bam 442,841 × 2
        13. HSV-Klinik_S2.taxfilt.bam 400,301 × 2
        14. HSV-Klinik_S2.mapped.bam 80,915 × 2
        15. bin/taxon_filter.py deplete \
        16. inBam=data/00_raw/HSV-Klinik_S2.bam \
        17. revertBam=tmp/01_cleaned/HSV-Klinik_S2.raw.bam \
        18. bmtaggerBam=tmp/01_cleaned/HSV-Klinik_S2.bmtagger_depleted.bam \
        19. rmdupBam=tmp/01_cleaned/HSV-Klinik_S2.rmdup.bam \
        20. blastnBam=data/01_cleaned/HSV-Klinik_S2.cleaned.bam \
        21. bmtaggerDbs=['/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3', \
        22. '/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA', \
        23. '/home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19'] \
        24. blastDbs=['/home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters', \
        25. '/home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus'] \
        26. srprism_memory=14250 \
        27. chunkSize=1000000 \
        28. clear_tags=False \
        29. tags_to_clear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP'] \
        30. JVMmemory=50g \
        31. threads=120 \
        32. loglevel=INFO \
        33. tmp_dir=/tmp \
        34. tmp_dirKeep=False
        35. inBam Input BAM file.
        36. revertBam Output BAM: read markup reverted with Picard.
        37. bwaBam Output BAM: depleted of reads with BWA.
        38. bmtaggerBam Output BAM: depleted of reads with BMTagger.
        39. rmdupBam Output BAM: bmtaggerBam run through M-Vicuna duplicate removal.
        40. blastnBam Output BAM: rmdupBam run through another depletion of reads with BLASTN.
        41. last876
  2. Using bengal3_ac3 pipeline to get trimmed reads and snippy interhost variants (for virus, it does not work!)

    1. #using the env bengal3_ac3
    2. conda activate bengal3_ac3
    3. mkdir interhost_variants; cd interhost_variants
    4. #prepare scritps
    5. cp /home/jhuang/Tools/bacto/bacto-0.1.json .
    6. cp /home/jhuang/Tools/bacto/cluster.json .
    7. cp /home/jhuang/Tools/bacto/Snakefile .
    8. ln -s /home/jhuang/Tools/bacto/local .
    9. ln -s /home/jhuang/Tools/bacto/db .
    10. ln -s /home/jhuang/Tools/bacto/envs .
    11. #preparing raw_data
    12. mkdir raw_data; cd raw_data
    13. ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R1_001.fastq.gz HSV1_S1_R1.fastq.gz
    14. ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV1_S1_L001_R2_001.fastq.gz HSV1_S1_R2.fastq.gz
    15. ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R1_001.fastq.gz HSV-Klinik_S2_R1.fastq.gz
    16. ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/HSV-Klinik_S2_L001_R2_001.fastq.gz HSV-Klinik_S2_R2.fastq.gz
    17. #ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R1_001.fastq.gz NTC_S3_R1.fastq.gz
    18. #ln -s ~/DATA/Data_Nicole_CaptureProbeSequencing/20241028_FS10003086_74_BTR67801-2217/Alignment_Imported_1/20241029_175539/Fastq/NTC_S3_L001_R2_001.fastq.gz NTC_S3_R2.fastq.gz
    19. #preparing bacto-0.1.json.
    20. "fastqc": false,
    21. "taxonomic_classifier": false,
    22. "assembly": false,
    23. "typing_ariba": false,
    24. "typing_mlst": false,
    25. "pangenome": false,
    26. "variants_calling": true,
    27. "phylogeny_fasttree": true,
    28. "phylogeny_raxml": true,
    29. "recombination": true,
    30. "genus": "Herpesvirus",
    31. "kingdom": "Viruses",
    32. "species": "human herpesvirus 1",
    33. "species": "herpes"
    34. "reference": "db/OP297860.gb",
    35. (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    36. # --DEBUG_1 (Don't need to run the step by changing the configuration!)--
    37. prokka --force --outdir prokka/HSV-Klinik_S2 --cpus 2 --usegenus --genus Herpesvirus --kingdom Viruses --species human herpesvirus 1 --addgenes --addmrna --prefix HSV-Klinik_S2 --locustag HSV-Klinik_S2 shovill/HSV-Klinik_S2/contigs.fa -hmm /media/jhuang/Titisee/GAMOLA2/TIGRfam_db/TIGRFAMs_15.0_HMM.LIB
    38. # - using bakta instead due to the error during the prokka-running (bakta doesn't work due to too huge fasta-file)
    39. bakta --db /mnt/nvme0n1p1/bakta_db shovill/HSV-Klinik_S2/contigs.fa --prefix HSV-Klinik_S2 --output prokka/HSV-Klinik_S2 --force
    40. # ---- running directly freebayes as follows ----
    41. cd data
    42. mkdir 02_align_to_OP297860
    43. ../bin/read_utils.py align_and_fix 01_per_sample/HSV1_S1.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV1_S1.bam --outBamFiltered 02_align_to_OP297860/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    44. ../bin/read_utils.py align_and_fix 01_per_sample/HSV-Klinik_S2.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV-Klinik_S2.bam --outBamFiltered 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    45. b
    46. samtools sort 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam -o HSV-Klinik_S2_reads_aligned_sorted.bam
    47. samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
    48. freebayes -f ../ref_genome/reference.fasta -i HSV-Klinik_S2_reads_aligned_sorted.bam --min-coverage 10 --min-alternate-count 3 --vcf freebayes_interhost_out.vcf
    49. #CHROM POS ID REF ALT QUAL
    50. OP297860.1 8885 . A G 7.37349e-13
    51. OP297860.1 8895 . A G 6.00837e-05
    52. OP297860.1 8956 . A G 339.579
    53. OP297860.1 8991 . ATTGT CCTGC 3188.1
    54. OP297860.1 9616 . C A 4.44801e-14
    55. OP297860.1 12748 . C A 63475.5
    56. #HSV-Klinik_S2-1 13203 A C 0.8466 snp 1.34479 C:1581:1060:1581:1060:1 A:21:15:21:15:1
    57. * OP297860.1 13203 . T C 86820.7
    58. OP297860.1 13755 . G A 107298
    59. OP297860.1 14114 . C A 1.21987e-13
    60. OP297860.1 46861 . T C 710.176
    61. * OP297860.1 47109 . T G 9375.53
    62. OP297860.1 47170 . G T 5942.86
    63. OP297860.1 47182 . G A 6108.66
    64. OP297860.1 47320 . A G 10275.4
    65. OP297860.1 47377 . G T 972.379
    66. OP297860.1 47516 . T C 257.388
    67. OP297860.1 47563 . G A 372.177
    68. OP297860.1 47660 . G A 438.692
    69. OP297860.1 47707 . T C 3252.11
    70. OP297860.1 47722 . A G 5343.39
    71. OP297860.1 48064 . G A 21575.7
    72. OP297860.1 48113 . C T 4284.1
    73. OP297860.1 48129 . T C 1778.66
    74. OP297860.1 48167 . T C 3316.44
    75. OP297860.1 48219 . A C 6892.21
    76. OP297860.1 48398 . C A 5.72805e-16
    77. OP297860.1 53216 . G T 2031
    78. OP297860.1 53298 . A G 465.154
    79. OP297860.1 53423 . C T 5586.37
    80. OP297860.1 54025 . A G 385.75
    81. OP297860.1 54073 . G A 8463.94
    82. OP297860.1 54408 . T G 2923.39
    83. OP297860.1 54568 . G T 1391.08
    84. OP297860.1 54708 . TG GA,TA 840.319
    85. OP297860.1 54769 . G T 1.72979e-14
    86. * OP297860.1 55501 . T C 33158.1
    87. * OP297860.1 55807 . C A 0
    88. OP297860.1 56493 . A G 39336.9
    89. OP297860.1 56867 . C A 7.83521e-14
    90. OP297860.1 57513 . C A 0
    91. OP297860.1 58047 . A T 4.21917e-15
    92. OP297860.1 58054 . C A 0
    93. OP297860.1 58056 . ACCA TCCT 0
    94. OP297860.1 58075 . ACTC GCTT 2947.03
    95. OP297860.1 63377 . C A 0
    96. OP297860.1 63393 . G T 1.39225e-14
    97. OP297860.1 65179 . T C 7903.32
    98. * OP297860.1 65225 . G A 13223.5
    99. * OP297860.1 65402 . C A 1.53811e-13
    100. OP297860.1 65992 . T C 25982.5
    101. OP297860.1 66677 . G A 5.27367e-15
    102. OP297860.1 67131 . C A 225.935
    103. OP297860.1 67336 . G A 8.13698e-15
    104. OP297860.1 94706 . C A 0
    105. OP297860.1 94709 . G T 0
    106. * OP297860.1 94750 . G T 0
    107. OP297860.1 95750 . C A 2.89975e-08
    108. OP297860.1 95990 . C A 0
    109. OP297860.1 96070 . G T 0
    110. OP297860.1 137360 . G T 0
    111. OP297860.1 137373 . C A 0
    112. OP297860.1 137527 . A T 4880.59
    113. OP297860.1 137569 . C T 10142.1
    114. OP297860.1 137602 . C A 19065.3
    115. OP297860.1 137986 . A G 0
    116. OP297860.1 138170 . T C 53588.3
    117. OP297860.1 138343 . C T 7310.38
  3. spandx varicant calling (see http://xgenes.com/article/article-content/314/call-and-merge-snp-and-indel-results-from-using-two-variant-calling-methods/)

    1. mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860
    2. #cp OP297860.gb ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860/genes.gbk
    3. vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    4. /home/jhuang/miniconda3/envs/spandx/bin/snpEff build OP297860 -d
    5. #Protein check: OP297860 OK: 73 Not found: 0 Errors: 0 Error percentage: 0.0%
    6. ## -- try using gffs+fa to install the database, but failed --
    7. #cp OP297860.gff3 ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860/genes.gff
    8. #cp OP297860.fasta ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/OP297860//sequences.fa
    9. #vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    10. ##OP297860.genome : Herpes_Simplex_Virus_1
    11. #snpEff build -v OP297860 #-gff3
    12. #cd /path/to/snpEff/data
    13. #mkdir NC_001806
    14. #cp NC_001806.gff3 NC_001806/genes.gff
    15. #cp NC_001806.fa NC_001806/sequences.fa
    16. ##NC_001806.genome : Herpes_Simplex_Virus_1
    17. ##bcftools reheader -h new_header.vcf HSV-Klinik_S2.PASS.snps.vcf -o updated_vcf.vcf
    18. ##table_annovar <input.vcf> <humandb> -buildver <genome_version> -out <output_prefix> -protocol <protocol_list> -operation <operation_list>
    19. cd trimmed
    20. mv HSV1_S1_trimmed_P_1.fastq HSV1_S1_R1.fastq
    21. mv HSV1_S1_trimmed_P_2.fastq HSV1_S1_R2.fastq
    22. mv HSV-Klinik_S2_trimmed_P_1.fastq HSV-Klinik_S2_R1.fastq
    23. mv HSV-Klinik_S2_trimmed_P_2.fastq HSV-Klinik_S2_R2.fastq
    24. gzip *_R1.fastq *_R2.fastq
    25. cp ref_genome/reference.fasta OP297860.fasta
    26. #Clean the header to only retain the accession-id "OP297860.1"
    27. ln -s /home/jhuang/Tools/spandx/ spandx
    28. (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_R{1,2}.fastq.gz" --ref OP297860.fasta --annotation --database OP297860 -resume
    29. # -- DEBUG: All_SNPs_indels_annotated.txt is not correctly annotated, manually rerun snpeff-4.1l-8 and related steps --
    30. ## OPTION_1: copy the viral-ngs4 database to the spandx database, failed during the version difference --
    31. #cp /home/jhuang/miniconda3/envs/viral-ngs4/share/snpeff-4.1l-8/data/1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c/genes.gbk .
    32. ##/home/jhuang/miniconda3/envs/spandx/bin/snpEff build OP297860 -d
    33. # OPTION_2: run via interhost.py (SUCCESSFUL!)
    34. #repeat the processing in spandx/bin/SNP_matrix.sh to generate All_SNPs_indels_annotated.txt, the snpEff step using we with 'bin/interhost.py snpEff' in the env viral-ngs4
    35. cd work/f8/93141f3ef382d7ac9dd40def9c50ce (last directory sorted by timestamp)
    36. #gatk VariantsToTable -V out.vcf -F CHROM -F POS -F REF -F ALT -F TYPE -GF GT -O out.vcf.table.all
    37. #
    38. ##clean-up the out.vcf.table.all because GATK outputs A/A
    39. #sed -i 's#|#/#g' out.vcf.table.all
    40. #awk ' { for (i=6; i<=NF; i++) {
    41. # if ($i == "A/A") $i="A";
    42. # if ($i == "G/G") $i="G";
    43. # if ($i == "C/C") $i="C";
    44. # if ($i == "T/T") $i="T";
    45. # if ($i == "*/*") $i="*";
    46. # if ($i == "./.") $i=".";
    47. # }};
    48. # {print $0} ' out.vcf.table.all > out.vcf.table.all.tmp
    49. #awk ' { for (i=6; i<=NF; i++) {
    50. # if ($i ~ /\//) {
    51. # split($i, a, "/");
    52. # if (a[1] == a[2]) $i=a[1];
    53. # }
    54. # }
    55. # };
    56. # {print $0} ' out.vcf.table.all.tmp > out.vcf.table.all
    57. # Switch the env to viral-ngs4 and manully run snpEff
    58. #(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/work/ea/6f30cd5eed0efbbf3e3fe1ddfac0df$ snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v 1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c out.vcf > out.annotated.vcf ##-debug
    59. # Number of chromosomes : 1
    60. # Chromosomes : Format 'chromo_name size codon_table'
    61. # 'OP297860' 152526 Standard
    62. #vim /home/jhuang/miniconda3/envs/viral-ngs4/share/snpeff-4.1l-8/snpEff.config
    63. #1158c840951524dbd03a1a055a837d3828f6f29af1ec2771219e77c.chromosomes : OP297860
    64. # Alternative snpEff calling with "bin/interhost.py snpEff"
    65. #(viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/work/ea/6f30cd5eed0efbbf3e3fe1ddfac0df$ ../../../bin/interhost.py snpEff out.filtered.vcf OP297860.1 out.annotated.vcf j.huang@uke.de --loglevel DEBUG
    66. #remove headers from annotated vcf and out.vcf
    67. grep -v '#' out.annotated.vcf > out.annotated.vcf.headerless
    68. #grep -v '#' out.vcf > out.vcf.headerless
    69. awk '{
    70. if (match($0,"EFF=")){print substr($0,RSTART)}
    71. else
    72. print ""
    73. }' out.annotated.vcf.headerless > effects
    74. sed -i 's/EFF=//' effects
    75. sed -i 's/(/ /g' effects
    76. sed -i 's/|/ /g' effects
    77. sed -i 's/UPSTREAM MODIFIER /UPSTREAM MODIFIER - /g' effects
    78. cut -d " " -f -8 effects > effects.mrg
    79. sed -i 's/ /\t/g' effects.mrg
    80. rm effects
    81. tail -n+2 out.vcf.table.all > out.vcf.table.all.headerless
    82. sed -i 's/ /\t/g' out.vcf.table.all.headerless
    83. paste out.vcf.table.all.headerless effects.mrg > out.vcf.headerless.plus.effects
    84. head -n1 out.vcf.table.all | sed 's/.GT//g' > header.left
    85. echo -e "Effect\tImpact\tFunctional_Class\tCodon_change\tProtein_and_nucleotide_change\tAmino_Acid_Length\tGene_name\tBiotype" > header.right
    86. paste header.left header.right > header
    87. cat header out.vcf.headerless.plus.effects > All_SNPs_indels_annotated.txt
    88. echo "SPANDx has finished"
    89. cp All_SNPs_indels_annotated.txt ../../../Outputs/Phylogeny_and_annotation/
  4. merge the two variant calling

    1. #Output: interhost_variants/snippy/summary_snps_indels.csv
    2. python3 ~/Scripts/summarize_snippy_res.py interhost_variants/snippy #Note that although the ALT bases are wrong, but we only need the positions. We can use the results for downstream processing!
    3. #Sort summary_snps_indels.csv according to the coordinate positions.
    4. #merge the following two files summary_snps_indels.csv (70) and All_SNPs_indels_annotated.txt (819) --> merged_variants.csv (69)
    5. python3 ~/Scripts/merge_snps_indels.py interhost_variants/snippy/summary_snps_indels.csv Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt merged_variants.csv
    6. #check if the number of the output file is correct?
    7. comm -12 <(cut -d, -f2 interhost_variants/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq) | wc -l
    8. comm -12 <(cut -d, -f2 interhost_variants/snippy/summary_snps_indels.csv | sort | uniq) <(cut -f2 Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated.txt | sort | uniq)
    9. #The only difference is 58615
    10. #Manually check the final results and delete some strange results and save merged_variants.csv as variants.xlsx
    11. #sort interhost_index -u > interhost_index_sorted
    12. #sort intrahost_index -u > intrahost_index_sorted
    13. #comm interhost_index_sorted intrahost_index_sorted
    14. # !!!! Manually checking intrahost records, if one record in a sample-group > 0.5, it should be a record interhost, look for if the records in the spandx-result. If the record is there, copy it to the interhost variant sheet!
    15. The records in all records of intrahost variants should be always < 0.5, if a record is > 0.5, if should be in interhost variants. Delete all records from intrahost variants when a record > 0.5 and it is not occuring in All_SNPs_indels_annotated.txt !!!! Ausnahme ist the record such as 65225:
    16. #OP297860 65225 G A SNP G G/A intragenic_variant MODIFIER n.65225G>A UL30
    17. #OP297860 65225 HSV1_S1 HSV1_S1 G,A 0 intragenic_variant n.65225G>A UL30 Gene_63070_67475
    18. #OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.891530460624071 intragenic_variant n.65225G>A UL30 Gene_63070_67475
    19. ##improve the header
    20. #sed -i '1s/_trimmed_P//g' merged_variants.csv
    21. ##check the REF and K1 have the same base and delete those records with difference.
    22. #cut -f3 -d',' merged_variants.csv > f3
    23. #cut -f6 -d',' merged_variants.csv > f6
    24. #diff f3 f6
    25. #awk -F, '$3 == $6 || NR==1' merged_variants.csv > filtered_merged_variants.csv #(93)
    26. #cut -f3 -d',' filtered_merged_variants.csv > f3
    27. #cut -f6 -d',' filtered_merged_variants.csv > f6
    28. #diff f3 f6
    29. ##MANUALLY REMOVE the column f6 in filtered_merged_variants.csv, and rename CHROM to HDRNA_01_K01 in the header, summarize chr and plasmids SNPs of a sample together to a single list, save as an Excel-file.
  5. (Optional, the step is currently only for intrahost variant calling) Filtering low complexity

    1. fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    2. fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    3. Read1 before filtering:
    4. total reads: 1755209
    5. total bases: 163663141
    6. Q20 bases: 162306612(99.1711%)
    7. Q30 bases: 159234526(97.2941%)
    8. Read2 before filtering:
    9. total reads: 1755209
    10. total bases: 163045950
    11. Q20 bases: 161178082(98.8544%)
    12. Q30 bases: 157052184(96.3239%)
    13. Read1 after filtering:
    14. total reads: 1733241
    15. total bases: 161547828
    16. Q20 bases: 160217907(99.1768%)
    17. Q30 bases: 157196236(97.3063%)
    18. Read2 aftering filtering:
    19. total reads: 1733241
    20. total bases: 160825521
    21. Q20 bases: 159057902(98.9009%)
    22. Q30 bases: 155354052(96.5979%)
    23. Filtering result:
    24. reads passed filter: 3466482
    25. reads failed due to low quality: 550
    26. reads failed due to too many N: 0
    27. reads failed due to too short: 0
    28. reads failed due to low complexity: 43386
    29. reads with adapter trimmed: 21424
    30. bases trimmed due to adapters: 159261
    31. Duplication rate: 14.2379%
    32. Insert size peak (evaluated by paired-end reads): 41
    33. JSON report: fastp.json
    34. HTML report: fastp.html
    35. fastp -i HSV1_S1_trimmed_P_1.fastq -I HSV1_S1_trimmed_P_2.fastq -o HSV1_S1_trimmed_R1.fastq -O HSV1_S1_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    36. fastp v0.20.1, time used: 7 seconds
    37. Read1 before filtering:
    38. total reads: 2688264
    39. total bases: 330035144
    40. Q20 bases: 326999269(99.0801%)
    41. Q30 bases: 320136918(97.0009%)
    42. Read2 before filtering:
    43. total reads: 2688264
    44. total bases: 327364405
    45. Q20 bases: 323331005(98.7679%)
    46. Q30 bases: 314500076(96.0703%)
    47. Read1 after filtering:
    48. total reads: 2660598
    49. total bases: 326564634
    50. Q20 bases: 323572956(99.0839%)
    51. Q30 bases: 316783667(97.0049%)
    52. Read2 aftering filtering:
    53. total reads: 2660598
    54. total bases: 324709841
    55. Q20 bases: 320840657(98.8084%)
    56. Q30 bases: 312570288(96.2614%)
    57. Filtering result:
    58. reads passed filter: 5321196
    59. reads failed due to low quality: 1110
    60. reads failed due to too many N: 0
    61. reads failed due to too short: 0
    62. reads failed due to low complexity: 54222
    63. reads with adapter trimmed: 39080
    64. bases trimmed due to adapters: 357915
    65. Duplication rate: 9.91821%
    66. Insert size peak (evaluated by paired-end reads): 96
    67. JSON report: fastp.json
    68. HTML report: fastp.html
    69. fastp -i HSV-Klinik_S2_trimmed_P_1.fastq -I HSV-Klinik_S2_trimmed_P_2.fastq -o HSV-Klinik_S2_trimmed_R1.fastq -O HSV-Klinik_S2_trimmed_R2.fastq --low_complexity_filter --complexity_threshold 30
    70. fastp v0.20.1, time used: 15 seconds
  6. Using vrap to assembly and annotate the contigs, the spades-step was replaced with idba of DAMIAN; DAMIAN's host-removal steps can also as the confirmation steps for viral-ngs.

    1. # Starting data: ln -s interhost_variants/trimmed .
    2. ln -s ~/Tools/vrap/ .
    3. #CHANGE the txid10298 in download_db.py: txid10298[Organism] AND complete genome[Title]
    4. gzip trimmed/*_R1.fastq trimmed/*_R2.fastq
    5. mv trimmed/*.gz ./
    6. #--host /home/jhuang/REFs/genome.fa --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr
    7. vrap/vrap.py -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out_v3 --bt2idx=/home/jhuang/REFs/genome -t 100 -l 200 -g
    8. vrap/vrap.py -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out_v3 --bt2idx=/home/jhuang/REFs/genome -t 100 -l 200 -g
    9. #--> If ERROR in spades-assembly, we usding idba from DAMIAN assembly, copy the assembly to spades. Then rerun vrap.py above!
    10. # * 4 nt_dbs (--virus, --host, download_db.py(nucleotide), nt), 2 prot_db (download_db.py(protein), nr) for blast, save under ./blast/db/virus, ./blast/db/host, vrap/database/viral_db/viral_nucleotide, vrap/database/viral_db/viral_protein
    11. # * 1 bowtie_database for host removal (--host), save under ./bowtie/host.
    12. # * bowtie run before assembly
    13. # * blast run after assembly for the contigs, therefore it does not exist the taxfilt step in vrap.
    14. # * checking the order of the databases for annotation step, namely which database will be taken firstly for annotionn after setting --virus?
    15. # * If --host is for both bowtie and blastn, if only --bt2idx define, only bowtie, no blastn! --> commented --host=/home/jhuang/REFs/genome.fa still has the host-removal step!
    16. # * "--virus=vrap/database/viral_db/nucleotide.fa" don't need give, since it is already defined in ./blast/db/virus
    17. # * the process: lighter (fast, memory-efficient tool for correcting sequencing errors) --> flash (tool to find the correct overlap between paired-end reads and extend the reads by stitching them together) --> bowtie (delete the host reads) --> spades --> cap3 (CAP3: A DNA sequence assembly program and it has a capability to clip 5' and 3' low-quality regions of reads) --> calculating orf density --> hmmer --> blast
    18. # Download all virus genomes
    19. mv datasets /usr/local/bin/
    20. chmod +x /usr/local/bin/datasets
    21. #datasets download virus genome --complete-only --assembly-source refseq
    22. datasets download virus genome taxon "Viruses" --complete-only --refseq
    23. #To check for RefSeq data only, look for NC_, NM_, or similar prefixes in sequence headers and identifiers.
    24. wget -r -np -nH --cut-dirs=3 ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/
    25. # The commends for more comprehensive blast annotation
    26. vrap/vrap.py -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out_v4 --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/vrap/database/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr -t 100 -l 200 -g
    27. vrap/vrap.py -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out_v4 --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/vrap/database/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr -t 100 -l 200 -g
    28. #END
    29. #using the bowtie of vrap to map the reads on ref_genome/reference.fasta
    30. vrap/vrap.py -1 trimmed/HSV1_S1_R1.fastq.gz -2 trimmed/HSV1_S1_R2.fastq.gz -o HSV1_S1_vrap_out_v5 --host ref_genome/reference.fasta -t 100 -l 200 -g
    31. vrap/vrap.py -1 trimmed/HSV-Klinik_S2_R1.fastq.gz -2 trimmed/HSV-Klinik_S2_R2.fastq.gz -o HSV-Klinik_S2_vrap_out_v5 --host ref_genome/reference.fasta -t 100 -l 200 -g
    32. cd bowtie
    33. mv mapped mapped.sam
    34. samtools view -S -b mapped.sam > mapped.bam
    35. samtools sort mapped.bam -o mapped_sorted.bam
    36. samtools index mapped_sorted.bam
    37. samtools view -H mapped_sorted.bam
    38. samtools flagstat mapped_sorted.bam
    39. #106435 + 0 mapped (3.11% : N/A)
    40. #106435 + 0 primary mapped (3.11% : N/A)
    41. #8204 + 0 properly paired (0.26% : N/A)
    42. #63742 + 0 with itself and mate mapped
    43. 8204+63742
    44. #1144448 + 0 mapped (26.25% : N/A)
    45. #1144448 + 0 primary mapped (26.25% : N/A)
    46. #124068 + 0 properly paired (3.76% : N/A)
    47. #581256 + 0 with itself and mate mapped
    48. 124068+581256
    49. bamCoverage -b mapped_sorted.bam -o ../../HSV1_S1_reads_coverage2.bw
    50. bamCoverage -b mapped_sorted.bam -o ../../HSV-Klinik_S2_reads_coverage2.bw
    51. #Command line spades:
    52. /home/jhuang/miniconda3/envs/vrap/bin/spades.py -1 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.1.fastq -2 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.2.fastq --s1 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/bowtie/bowtie.un.fastq -k 33,55,77,99,127 --cov-cutoff off --only-assembler --careful -t 100 -o /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/spades
    53. #Command line cap3:
    54. /home/jhuang/Tools/vrap/external_tools/cap3 /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_vrap_out_v3/spades/contigs.fasta -y 100
    55. damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV1_S1_trimmed_R2.fastq.gz --sample HSV1_S1_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    56. damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R1.fastq.gz -2 /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/HSV-Klinik_S2_trimmed_R2.fastq.gz --sample HSV-Klinik_S2_megablast --blastn never --blastp never --min_contiglength 100 --threads 56 --force
    57. [16:42:55 2024-11-12] Removing adapter and host sequences
    58. Trimming readpair 1: /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV1_S1_R1.fastq.gz and /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV1_S1_R2.fastq.gz
    59. Host reads: 11.71%
    60. Fragment size: 212 (sd:64)
    61. Subtracting host: human3 (Homo_sapiens_UCSC_hg38 (dna))
    62. Alignment rate: 0.52%
    63. Subtracting host: human3 (Homo sapiens (cdna))
    64. Alignment rate: 0.02%
    65. Subtracting host: human3 (Homo sapiens (ncrna))
    66. Alignment rate: 0.01%
    67. [17:20:31 2024-11-12] Removing adapter and host sequences
    68. Trimming readpair 1: /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV-Klinik_S2_R1.fastq.gz and /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/trimmed/HSV-Klinik_S2_R2.fastq.gz
    69. Host reads: 44.47%
    70. Fragment size: 236 (sd:77)
    71. Subtracting host: human3 (Homo_sapiens_UCSC_hg38 (dna))
    72. Alignment rate: 29.34%
    73. Subtracting host: human3 (Homo sapiens (cdna))
    74. Alignment rate: 0.66%
    75. Subtracting host: human3 (Homo sapiens (ncrna))
    76. Alignment rate: 0.64%
    77. [17:25:27 2024-11-12] Assembling
    78. [17:38:39 2024-11-12] Parsing assembly
    79. Large contigs (500bp and longer): 259
    80. Large orfs (75bp and longer): 843
    81. [17:38:58 2024-11-12] Seeking protein domains
    82. Contigs with domains: 162
    83. [17:40:36 2024-11-12] Annotating contigs
    84. cp ~/rtpd_files/HSV1_S1_megablast/idba_ud_assembly/contig.fa contigs.fasta
    85. cp ~/rtpd_files/HSV-Klinik_S2_megablast/idba_ud_assembly/contig.fa contigs.fasta
    86. #RERUN vrap/vrap.py again with the replaced contigs.fasta!
    87. #vrap/vrap.py -1 Affe30_trimmed_R1.fastq.gz -2 Affe30_trimmed_R2.fastq.gz -o Affe30_trimmed_vrap_out -t 40 -l 100
    88. #vrap/vrap.py -1 Affe31_trimmed_R1.fastq.gz -2 Affe31_trimmed_R2.fastq.gz -o Affe31_trimmed_vrap_out -t 40 -l 100
    89. # -- DEBUG_1 --
    90. #DO NOT use '-l 100' in command line
    91. #name 'generic_dna' is not defined
    92. mamba install biopython=1.77 python=3.9 #for supporting "generic_dna"
    93. # SET all records from vrap/database/viral_db/nucleotide.fa as lastal.acids, choose the most occurred in vrap_out as refsel.acids and the record for accessions_for_ref_genome_build in config.yaml.
    94. # Query coverage
    95. Query sequence name Query length ORF density Percentage identity Subject sequence length Subject accession Subject name E-value
    96. grep "Human alphaherpesvirus 1" HSV-Klinik_S2_contigs_summary.csv > HSV-Klinik_S2_contigs_summary_.csv
    97. #--> ON960057.1
    98. name: The name or identifier of the query sequence. This is typically the header from the input sequence file, such as a FASTA file.
    99. qleng: Query length, or the total length of the input sequence (in nucleotides or amino acids, depending on the input type).
    100. orf_d: ORF (Open Reading Frame) direction. This indicates the strand or frame in which the ORF was found, often shown as + for the forward direction or - for the reverse direction.
    101. hmmer_eval: The E-value from the HMMER (Hidden Markov Model) search. This represents the statistical significance of the match between the identified ORF and the reference HMM model. Lower values indicate more significant matches.
    102. hmm_model: The name of the HMM (Hidden Markov Model) profile matched by the ORF. This typically corresponds to a specific viral or protein family model from an HMM database, such as Pfam or custom models used by VRAP.
    103. ident: Percentage identity between the query sequence and the target model or database entry. This measures the similarity of the ORF to the matched model.
    104. qcov: Query coverage, or the percentage of the query sequence that aligns to the target model. This indicates how much of the ORF sequence aligns with the HMM profile.
    105. tcov: Target coverage, or the percentage of the target HMM profile that aligns with the query. This helps assess how well the ORF represents the entire HMM model.
    106. tlength: Target length, or the length of the HMM model sequence in the database. This value can be used to understand how much of the target model was covered by the ORF alignment.
    107. tid: Target identifier, often an accession or ID number for the matched HMM model. This is used to uniquely identify the model within the HMM database.
    108. tname: Target name or description, which provides more information about the HMM model or protein family that the ORF matches.
    109. mean_eval: Mean E-value for the HMMER match, averaged over multiple potential alignments (if any). Lower values imply higher significance, with the mean providing an aggregate metric if there were multiple HMM matches.
    110. #reads_and_contigs_on_JX878414.png
    111. #using the assembly for the calling!
    112. #TODO_TOMORROW: In the final results only mark the SNPs in the contigs > 500 nt (shown as in the figure), otherwise we have too much results! then merge snps (now there is an ERROR during merging!)
  7. Analyses using viral-ngs

    1. conda activate viral3
    2. #conda install -c anaconda openjdk=8
    3. ln -s ~/Tools/viral-ngs/Snakefile Snakefile
    4. ln -s ~/Tools/viral-ngs/bin bin
    5. cp ~/Tools/viral-ngs/refsel.acids refsel.acids
    6. cp ~/Tools/viral-ngs/lastal.acids lastal.acids
    7. cp ~/Tools/viral-ngs/config.yaml config.yaml
    8. cp ~/Tools/viral-ngs/samples-runs.txt samples-runs.txt
    9. cp ~/Tools/viral-ngs/samples-depletion.txt samples-depletion.txt
    10. cp ~/Tools/viral-ngs/samples-metagenomics.txt samples-metagenomics.txt
    11. cp ~/Tools/viral-ngs/samples-assembly.txt samples-assembly.txt
    12. cp ~/Tools/viral-ngs/samples-assembly-failures.txt samples-assembly-failures.txt
    13. mkdir data
    14. cd data
    15. mkdir 00_raw
    16. cd ../..
  8. Prepare lastal.acids, refsel.acids and accessions_for_ref_genome_build in config.yaml

    1. #Herpes simplex virus 1 (HSV-1) and Human alphaherpesvirus 1 (also known as Simplexvirus humanalpha1) are indeed the same virus.
    2. #The different names result from varied naming conventions:
    3. # * Herpes simplex virus 1 (HSV-1) is the common name, often used in clinical and general contexts.
    4. # * Human alphaherpesvirus 1 is the official taxonomic name, as defined by the International Committee on Taxonomy of Viruses (ICTV). This name is used in scientific classifications and databases like NCBI to specify its place in the Herpesviridae family under the Alphaherpesvirinae subfamily.
    5. #In some databases or references, it might also appear under Simplexvirus humanalpha1, which refers to its taxonomic classification at the genus level (Simplexvirus) and species level (Human alphaherpesvirus 1). However, all these terms refer to the same virus, commonly known as HSV-1.
    6. #https://www.uniprot.org/taxonomy?query=Human+herpesvirus
    7. #https://www.uniprot.org/taxonomy/3050292
    8. esearch -db nuccore -query "txid3050292[Organism]" | efetch -format fasta > taxon_3050292_sequences.fasta
    9. esearch -db nuccore -query "txid3050292[Organism]" | efetch -format acc > taxon_3050292_accessions.txt
    10. esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format fasta > taxon_3050292_complete_genomes.fasta
    11. esearch -db nuccore -query "txid10298[Organism] AND complete genome[Title]" | efetch -format acc > taxon_10298_complete_genomes.acc # 161 genomes
    12. mv taxon_10298_complete_genomes.acc lastal.acids
    13. https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10298
    14. Human alphaherpesvirus 1 (Herpes simplex virus type 1) Click on organism name to get more information.
    15. Human alphaherpesvirus 1 strain 17
    16. Human alphaherpesvirus 1 strain A44
    17. Human alphaherpesvirus 1 strain Angelotti
    18. Human alphaherpesvirus 1 strain CL101
    19. Human alphaherpesvirus 1 strain CVG-2
    20. Human alphaherpesvirus 1 strain F
    21. Human alphaherpesvirus 1 strain H129
    22. Human alphaherpesvirus 1 strain HFEM
    23. Human alphaherpesvirus 1 strain HZT
    24. Human alphaherpesvirus 1 strain KOS
    25. Human alphaherpesvirus 1 strain MGH-10
    26. Human alphaherpesvirus 1 strain MP
    27. Human alphaherpesvirus 1 strain Patton
    28. Human alphaherpesvirus 1 strain R-15
    29. Human alphaherpesvirus 1 strain R19
    30. Human alphaherpesvirus 1 strain RH2
    31. Human alphaherpesvirus 1 strain SC16
  9. Trimming using trimmomatic

    1. # Starting data: ln -s interhost_variants/raw_data .
    2. mkdir bams
    3. for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    4. for sample in HSV1_S1; do
    5. java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ./raw_data/${sample}_R1.fastq.gz ./raw_data/${sample}_R2.fastq.gz trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; \
    6. done
  10. Mapping

    1. cd trimmed
    2. seqtk sample -s100 HSV1_S1_R1.fastq.gz 0.1 > HSV1_S1_sampled_R1.fastq
    3. seqtk sample -s100 HSV1_S1_R2.fastq.gz 0.1 > HSV1_S1_sampled_R2.fastq
    4. gzip HSV1_S1_sampled_R1.fastq HSV1_S1_sampled_R2.fastq
    5. ref_fa="NC_001806.fasta";
    6. for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    7. for sample in HSV1_S1; do
    8. for sample in HSV1_S1_sampled; do
    9. bwa index ${ref_fa}; \
    10. bwa mem -M -t 16 ${ref_fa} trimmed/${sample}_R1.fastq.gz trimmed/${sample}_R2.fastq.gz | samtools view -bS - > bams/${sample}_genome_alignment.bam; \
    11. #for table filling using the following commands! -->3000000 \
    12. #bwa mem -M -t 14 ${ref_fa} ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz | samtools view -bS -F 256 - > bams/${sample}_uniqmap.bam; \
    13. done
  11. AddOrReplaceReadGroup is IMPORTANT step, otherwise the step viral_ngs cannot run correctly

    1. for sample in HSV1_S1 HSV-Klinik_S2 NTC_S3; do
    2. for sample in HSV1_S1; do
    3. for sample in HSV1_S1_sampled; do
    4. picard AddOrReplaceReadGroups I=bams/${sample}_genome_alignment.bam O=data/00_raw/${sample}.bam SORT_ORDER=coordinate CREATE_INDEX=true RGPL=illumina RGID=$sample RGSM=$sample RGLB=standard RGPU=$sample VALIDATION_STRINGENCY=LENIENT; \
    5. done
  12. Configure the viral-ngs conda environment

    1. conda config --add channels r
    2. conda config --add channels defaults
    3. conda config --add channels conda-forge
    4. conda config --add channels bioconda
    5. conda config --add channels broad-viral
    6. # -- Works not correctly --
    7. #conda list --export > environment2.yml
    8. #mamba create --name viral-ngs4 --file environment2.yml
    9. mamba env remove -n viral-ngs4
    10. mamba create -n viral-ngs4 python=3.6 blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
    11. conda activate viral-ngs4
    12. mamba create -n viral-ngs4 python=3.6
    13. conda activate viral-ngs4
    14. #vim requirements-conda.txt
    15. mamba install blast=2.6.0 bmtagger biopython pysam pyyaml picard mvicuna pybedtools fastqc matplotlib spades last=876 -c conda-forge -c bioconda
    16. # -- Eventually DEBUG --
    17. #mamba remove picard
    18. #mamba clean --packages
    19. #mamba install -c bioconda picard
    20. ##mamba install libgfortran=5 sqlite=3.46.0
    21. ##mamba install picard --clobber
    22. ##mamba create -n viral-ngs-fresh -c bioconda -c conda-forge picard python=3.6 sqlite=3.46.0 libgfortran=5
    23. mamba install cd-hit cd-hit-auxtools diamond gap2seq=2.1 mafft=7.221 mummer4 muscle=3.8 parallel pigz prinseq samtools=1.6 tbl2asn trimmomatic trinity unzip vphaser2 bedtools -c r -c defaults -c conda-forge -c bioconda #-c broad-viral
    24. mamba install snpeff=4.1l
    25. mamba install gatk=3.6
    26. mamba install bwa
    27. #IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
    28. #IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
    29. #IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
    30. #IMPORTANT_CHECK if it works
    31. # java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
    32. # /usr/local/bin/gatk -T RealignerTargetCreator --help
    33. #IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
    34. mamba install vphaser2=2.0
    35. # -- NO ERROR --> INSTALL END HERE --
    36. # -- DEBUG: ClobberError: This transaction has incompatible packages due to a shared path. --
    37. # SafetyError: The package for snpeff located at /home/jhuang/miniconda3/pkgs/snpeff-4.1l-hdfd78af_8
    38. # appears to be corrupted. The path 'share/snpeff-4.1l-8/snpEff.config'
    39. # has an incorrect size.
    40. # reported size: 9460047 bytes
    41. # actual size: 9460357 bytes
    42. #
    43. # ClobberError: This transaction has incompatible packages due to a shared path.
    44. # packages: bioconda/linux-64::bowtie2-2.5.4-h7071971_4, bioconda/linux-64::bowtie-1.3.1-py36h769816f_3
    45. # path: 'bin/scripts/convert_quals.pl'
    46. # sovle confilict between bowtie, bowtie2 and snpeff
    47. mamba remove bowtie
    48. mamba install bowtie2
    49. mamba remove snpeff
    50. mamba install snpeff=4.1l
    51. # -- WITH ERROR caused by bowtie and snpeff --> INSTALL END HERE --
    52. #mamba install -c bioconda viral-ngs #so that gatk3-register and novoalign-license-register available --> ERROR
    53. #Due to license restrictions, the viral-ngs conda package cannot distribute and install GATK directly. To fully install GATK, you must download a licensed copy of GATK v3.8 from the Broad Institute, and call “gatk3-register,” which will copy GATK into your viral-ngs conda environment:
    54. mkdir -p /path/to/gatk_dir
    55. wget -O - 'https://software.broadinstitute.org/gatk/download/auth?package=GATK-archive&version=3.6-0-g89b7209' | tar -xjvC /path/to/gatk_dir
    56. gatk3-register /path/to/gatk_dir/GenomeAnalysisTK.jar
    57. #The single-threaded version of Novoalign is installed by default. If you have a license for Novoalign to enable multi-threaded operation, viral-ngs will copy it to the viral-ngs conda environment if the NOVOALIGN_LICENSE_PATH environment variable is set. Alternatively, the conda version of Novoalign can be overridden if the NOVOALIGN_PATH environment variable is set. If you obtain a Novoalign license after viral-ngs has already been installed, it can be added to the conda environment by calling:
    58. # obtain a Novoalign license file: novoalign.lic
    59. novoalign-license-register /path/to/novoalign.lic
    60. # # --We don't have registers, so we have to manually install novoalign and gatk--
    61. # #At first install novoalign, then samtools
    62. # mamba remove samtools
    63. # mamba install -c bioconda novoalign # Eventually not necessary, since the path is defined in config.yaml NOVOALIGN_PATH: "/home/jhuang/Tools/novocraft_v3", and novoalign.lic is also in the same path.
    64. # mamba install -c bioconda samtools
    65. #
    66. # mamba install -c bioconda gatk #(3.8) #IN /usr/local/bin/gatk FROM /home/jhuang/Tools/SPANDx_v3.2/GenomeAnalysisTK.jar
    67. # #UPDATED TO: '/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar'
    68. # # If necessary, clean up the conda cache. This will remove any partially installed or corrupted packages.
    69. # conda clean --all
    70. ## reinstall samtools 1.6 --> NOT RELEVANT
    71. #mamba install samtools=1.6
  13. Run snakemake

    1. #Set values in samples-*.txt before running viral-ngs
    2. rm -rf ref_genome refsel_db lastal_db
    3. mv data data_v1;
    4. mv tmp tmp_v1;
    5. mkdir data tmp
    6. mv data_v1/00_raw data
    7. snakemake --printshellcmds --cores 10
    8. #Manully remove the records in the intrahost-results when it occurs in the interhost-tables as save the final intrahost-results as a Excel-Sheet in the variants.xlsx.
    9. /usr/local/bin/gatk
    10. https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
    11. java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
    12. java -jar ~/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help #--> CORRECT
    13. /usr/local/bin/gatk -T RealignerTargetCreator --help
    14. https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
    15. Djava.io.tmpdir=/tmp/tmp-assembly-refine_assembly-2d9z3pcr
    16. java -jar ~/Tools/GenomeAnalysisTK-2.8-1/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    17. java -jar ~/Tools/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    18. ~/Tools/GenomeAnalysisTK-4.1.2.0/gatk -T RealignerTargetCreator -I /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp0_vh27ji.rmdup.bam -R /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmppwwyriob.deambig.fasta -o /tmp/tmp-assembly-refine_assembly-2d9z3pcr/tmp_o2f2e0o.intervals --num_threads 120
    19. # -- DEBUG_1: Configure the Conda Environment to Use Host's Java (version 17) while keeping BLAST 2.6.0+ --
    20. bin/taxon_filter.py deplete data/00_raw/HSV1_S1.bam tmp/01_cleaned/HSV1_S1.raw.bam tmp/01_cleaned/HSV1_S1.bmtagger_depleted.bam tmp/01_cleaned/HSV1_S1.rmdup.bam data/01_cleaned/HSV1_S1.cleaned.bam --bmtaggerDbs /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/metagenomics_contaminants_v3 /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA /home/jhuang/REFs/viral_ngs_dbs/bmtagger_dbs_remove/hg19 --blastDbs /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/metag_v3.ncRNA.mRNA.mitRNA.consensus /home/jhuang/REFs/viral_ngs_dbs/blast_dbs_remove/hybsel_probe_adapters --threads 120 --srprismMemory 142500000 --JVMmemory 256g --loglevel DEBUG
    21. #2024-11-06 15:55:01,162 - __init__:444:_attempt_install - DEBUG - Currently installed version of blast: 2.16.0-hc155240_2
    22. #2024-11-06 15:55:01,162 - __init__:448:_attempt_install - DEBUG - Expected version of blast: 2.6.0
    23. #2024-11-06 15:55:01,162 - __init__:449:_attempt_install - DEBUG - Incorrect version of blast installed. Removing it...
    24. # + (blast 2.6.0 needs java 17, therefore java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java" in /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard) blast 2.6.0 boost1.64_2 bioconda Cached
    25. # + (bmtagger 3.101 needs blast 2.6.0) blast=2.6.0 + bmtagger 3.101 h470a237_4 bioconda Cached
    26. # + pango 1.50.7 hbd2fdc8_0 conda-forge Cached
    27. # + openjdk 11.0.15 hc6918da_0 conda-forge Cached
    28. # + r-base 4.2.0 h1ae530e_0 pkgs/r Cached
    29. # + picard 3.0.0 hdfd78af_0 bioconda Cached
    30. # + java -version openjdk version "11.0.15-internal" 2022-04-19
    31. Then, edit in the following file so that it can use the host java (version 17) for the viral-ngs2 picard 3.0.0! --
    32. vim /home/jhuang/miniconda3/envs/viral-ngs2/bin/picard
    33. # ---------------------------------------------------------
    34. # Use Java installed with Anaconda to ensure correct version
    35. java="$ENV_PREFIX/bin/java"
    36. # if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
    37. if [ -n "${JAVA_HOME:=}" ]; then
    38. if [ -e "$JAVA_HOME/bin/java" ]; then
    39. java="$JAVA_HOME/bin/java"
    40. fi
    41. fi
    42. # -------------------------------------------------------->
    43. #COMMENTED
    44. # Use Java installed with Anaconda to ensure correct version
    45. #java="$ENV_PREFIX/bin/java"
    46. #MODIFIED
    47. ## if JAVA_HOME is set (non-empty), use it. Otherwise keep "java"
    48. #if [ -n "${JAVA_HOME:=}" ]; then
    49. # if [ -e "$JAVA_HOME/bin/java" ]; then
    50. # java="$JAVA_HOME/bin/java"
    51. # fi
    52. #fi
    53. java="/usr/lib/jvm/java-17-openjdk-amd64/bin/java"
    54. # ---------------------------------------------------------
    55. # -- DEBUG_2: lastal version not compatible --
    56. bin/ncbi.py fetch_fastas j.huang@uke.de lastal_db NC_001806.2 --combinedFilePrefix lastal --removeSeparateFiles --forceOverwrite --chunkSize 300
    57. bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
    58. mamba remove last
    59. mamba install -c bioconda last=876
    60. lastal -V
    61. bin/taxon_filter.py filter_lastal_bam data/01_cleaned/HSV1_S1.cleaned.bam lastal_db/lastal.fasta data/01_cleaned/HSV1_S1.taxfilt.bam --threads 120 --JVMmemory 256g --loglevel DEBUG
    62. # -- DEBUG_3: lastal version not compatible --
    63. bin/assembly.py gapfill_gap2seq tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffolded.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta --memLimitGb 12 --maskErrors --randomSeed 0 --loglevel DEBUG
    64. #2024-11-07 12:34:14,732 - __init__:460:_attempt_install - DEBUG - Attempting install...
    65. #2024-11-07 12:34:14,733 - __init__:545:install_package - DEBUG - Creating conda environment and installing package gap2seq=2.1
    66. mamba install gap2seq=2.1
    67. # -- DEBUG_4 --
    68. bin/assembly.py impute_from_reference tmp/02_assembly/HSV1_S1_sampled.assembly2-gapfilled.fasta tmp/02_assembly/HSV1_S1_sampled.assembly2-scaffold_ref.fasta tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta --newName HSV1_S1_sampled --replaceLength 55 --minLengthFraction 0.05 --minUnambig 0.05 --index --loglevel DEBUG
    69. 2024-11-07 14:05:20,438 - __init__:445:_attempt_install - DEBUG - Currently installed version of muscle: 5.2-h4ac6f70_0
    70. 2024-11-07 14:05:20,438 - __init__:448:_attempt_install - DEBUG - Expected version of muscle: 3.8.1551
    71. 2024-11-07 14:05:20,438 - __init__:449:_attempt_install - DEBUG - Incorrect version of muscle installed. Removing it...
    72. mamba install muscle=3.8
    73. #- muscle 5.2 h4ac6f70_0 bioconda Cached
    74. #+ muscle 3.8.1551 h7d875b9_6 bioconda Cached
    75. #/home/jhuang/Tools/novocraft_v3/novoalign -f data/01_per_sample/HSV1_S1.cleaned.bam -r Random -l 20 -g 40 -x 20 -t 100 -F BAM -d tmp/02_assembly/HSV1_S1.assembly4-refined.nix -o SAM
    76. # -- DEBUG_5 --
    77. bin/assembly.py refine_assembly tmp/02_assembly/HSV1_S1_sampled.assembly3-modify.fasta data/01_per_sample/HSV1_S1_sampled.cleaned.bam tmp/02_assembly/HSV1_S1_sampled.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV1_S1_sampled.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
    78. #Shebang in /usr/local/bin/gatk is corrupt.
    79. # -- DEBUG_6 --
    80. bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1_sampled.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
    81. 2024-11-07 14:47:34,163 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.526-h4bc722e_0
    82. 2024-11-07 14:47:34,163 - __init__:448:_attempt_install - DEBUG - Expected version of mafft: 7.221
    83. 2024-11-07 14:47:34,164 - __init__:449:_attempt_install - DEBUG - Incorrect version of mafft installed. Removing it...
    84. mamba install mafft=7.221
    85. # -- DEBUG_7 --
    86. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1_sampled.mapped.bam data/02_assembly/HSV1_S1_sampled.fasta data/04_intrahost/vphaser2.HSV1_S1_sampled.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    87. export TMPDIR=/home/jhuang/tmp
    88. (viral-ngs) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing$ /home/jhuang/miniconda3/envs/viral-ngs/bin/vphaser2 -i /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam -o /home/jhuang/tmp/tmpyg8mlj5qvphaser2
    89. samtools depth /home/jhuang/tmp/tmp_bq17yoq.mapped-withdoublymappedremoved.bam > coverage.txt
    90. # -- DEBUG_8 --
    91. snakemake --printshellcmds --cores 100
    92. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    93. snakemake --printshellcmds --cores 100
    94. # -- DEBUG_9 --
    95. bin/assembly.py refine_assembly tmp/02_assembly/HSV-Klinik_S2.assembly3-modify.fasta data/01_per_sample/HSV-Klinik_S2.cleaned.bam tmp/02_assembly/HSV-Klinik_S2.assembly4-refined.fasta --outVcf tmp/02_assembly/HSV-Klinik_S2.assembly3.vcf.gz --min_coverage 2 --novo_params '-r Random -l 20 -g 40 -x 20 -t 502' --threads 120 --loglevel DEBUG
    96. /usr/local/bin/gatk -Xmx20g -Djava.io.tmpdir=/home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p -T RealignerTargetCreator -I /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpwbzvjo9j.rmdup.bam -R /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmpxq4obe29.deambig.fasta -o /home/jhuang/tmp/tmp-assembly-refine_assembly-dx3dr73p/tmptkw8zcf3.intervals --num_threads 120
    97. mamba install gatk=3.6
    98. #IMPORTANT_REPLACE "sudo cp /home/jhuang/miniconda3/envs/viral-ngs4/bin/gatk3 /usr/local/bin/gatk"
    99. #IMPORTANT_UPDATE jar_file in the file with "/home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar"
    100. #IMPORTANT_SET /home/jhuang/Tools/GenomeAnalysisTK-3.6 as GATK_PATH in config.yaml
    101. #IMPORTANT_CHECK if it works
    102. # java -jar /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T RealignerTargetCreator --help
    103. # /usr/local/bin/gatk -T RealignerTargetCreator --help
    104. #IMPORTANT_NOTE that the env viral-ngs4 cannot logined from the base env due to the python3-conflict!
    105. # -- DEBUG_10 (if the sequencing is too shawlow, then seperate running) --
    106. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam -o /tmp/tmp1x6jsiu_vphaser2
    107. [EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file /tmp/tmp2jl4plhy.mapped-withdoublymappedremoved.bam
    108. # Run seperate intrahost.py --> no error:
    109. #342 reads
    110. 2024-11-08 14:27:33,575 - intrahost:223:compute_library_bias - DEBUG - LB:standard has 161068 reads in 1 read group(s) (HSV-Klinik_S2)
    111. 2024-11-08 14:27:34,875 - __init__:445:_attempt_install - DEBUG - Currently installed version of vphaser2: 2.0-h7a259b3_14
    112. samtools index HSV1_S1.mapped.bam
    113. samtools index HSV-Klinik_S2.mapped.bam
    114. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    115. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --minReadsEach 1 --maxBias 2 --loglevel DEBUG # --vphaserNumThreads 120 --removeDoublyMappedReads
    116. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
    117. samtools idxstats data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    118. samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    119. samtools view -H data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    120. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/output_dir
    121. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /tmp/tmpgacpc6eqvphaser2
    122. samtools view -b data/02_align_to_self/HSV-Klinik_S2.mapped.bam "HSV-Klinik_S2-1" > subset.bam
    123. samtools index subset.bam
    124. @SQ SN:HSV-Klinik_S2-1 LN:141125 AS:tmp35_s3ghx.ref_copy.nix
    125. samtools view -b subset.bam "HSV-Klinik_S2-1:1-10000" > small_subset.bam
    126. samtools index small_subset.bam
    127. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i small_subset.bam -o /tmp/output_dir
    128. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -i subset.bam -o vphaser2_out
    129. # -- DEBUG_11 in step multi_align_mafft: aligned_1.fasta is always empty, we need generate it manually with mafft and mark it as complete --
    130. #[Fri Nov 8 14:51:45 2024]
    131. #rule multi_align_mafft:
    132. # input: data/02_assembly/HSV1_S1.fasta, data/02_assembly/HSV-Klinik_S2.fasta, ref_genome/reference.fasta
    133. # output: data/03_multialign_to_ref/sampleNameList.txt, data/03_multialign_to_ref/aligned_1.fasta, data/03_multialign_to_ref/aligned_2.fasta, ... data/03_multialign_to_ref/aligned_161.fasta
    134. # jobid: 24
    135. # resources: tmpdir=/tmp, mem=8, threads=120
    136. bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV1_S1.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 --loglevel DEBUG
    137. #b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    138. #-------
    139. #2024-11-08 14:51:46,324 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, #Feb 28 2019, 09:07:38)
    140. #[GCC 7.3.0]
    141. #2024-11-08 15:00:26,375 - cmd:195:main_argparse - INFO - command: bin/interhost.py multichr_mafft inFastas=['ref_genome/reference.fasta', 'data/02_assembly/HSV1_S1.fasta', 'data/02_assembly/HSV-Klinik_S2.fasta'] localpair=True globalpair=None preservecase=True reorder=None gapOpeningPenalty=1.53 ep=0.123 verbose=False outputAsClustal=None maxiters=1000 outDirectory=data/03_multialign_to_ref outFilePrefix=aligned sampleRelationFile=None sampleNameListFile=data/03_multialign_to_ref/sampleNameList.txt threads=120 loglevel=DEBUG tmp_dir=/tmp tmp_dirKeep=False
    142. #2024-11-08 15:00:26,375 - cmd:209:main_argparse - DEBUG - using tempDir: /tmp/tmp-interhost-multichr_mafft-sw91_svl
    143. #2024-11-08 15:00:27,718 - __init__:445:_attempt_install - DEBUG - Currently installed version of mafft: 7.221-0
    144. #2024-11-08 15:00:27,719 - mafft:141:execute - DEBUG - /home/jhuang/miniconda3/envs/viral-ngs4/bin/mafft --thread 120 --localpair --preservecase --op 1.53 --ep 0.123 --quiet --maxiterate 1000 /tmp/tmp-interhost-multichr_mafft-sw91_svl/tmp68_ln_ha.fasta
    145. snakemake --cleanup-metadata 03_multialign_to_ref --cores 4
    146. # -- DEBUG_12 --
    147. #[EXIT]: gather_alignments: Failed to set region for reference HSV-Klinik_S2-1 in file data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    148. #DEBUG_PROCESS1: rm temp/*.region
    149. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -w 5000 -i data/02_align_to_self/HSV-Klinik_S2.mapped.bam -o /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/temp
    150. # 5209 snp
    151. # 21 lv
    152. #SOLUTION: MODFIED AS 'cmd = [self.install_and_get_path(), '-w 5000', '-i', inBam, '-o', outDir]' in bin/tools/vphaser2.py
    153. #ADDED
    154. cmd.append('-w')
    155. cmd.append('25000')
    156. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    157. #BEFORE_CHANGE:
    158. b'\n--------------------------------------------------------\nProgram runs with the following Parameter setting:\n\n\tinput BAM file\t=\t/tmp/tmpt6fgovqk.mapped-withdoublymappedremoved.bam\n\toutput Directory\t=\t/tmp/tmp53_oxecyvphaser2\n\terrModel\t\t=\tpileup + phase\n\talpha\t\t=\t0.05\n\tignoreBases \t=\t0\n\t(var_matepair, var_cycle, var_dt, var_qt)\t=\t1,1,1,20\n\tpSample\t\t=\t30%\n\twindowSz\t=\t500\n\tdelta\t=\t2\n\n
    159. #AFTER_CHANGE:
    160. windowSz=5000
    161. #mkdir 02_align_to_ref
    162. bin/read_utils.py align_and_fix data/01_per_sample/HSV1_S1.cleaned.bam refsel_db/refsel.fasta --outBamAll data/02_align_to_ref/HSV1_S1.bam --outBamFiltered data/02_align_to_ref/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    163. bin/read_utils.py align_and_fix data/01_per_sample/HSV-Klinik_S2.cleaned.bam refsel_db/refsel.fasta --outBamAll data/02_align_to_ref/HSV-Klinik_S2.bam --outBamFiltered data/02_align_to_ref/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    164. bin/intrahost.py vphaser_one_sample data/02_align_to_ref/HSV1_S1.mapped.bam refsel_db/refsel.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_ref.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    165. bin/intrahost.py vphaser_one_sample data/02_align_to_ref/HSV-Klinik_S2.mapped.bam refsel_db/refsel.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_ref.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    166. /home/jhuang/miniconda3/envs/viral-ngs4/bin/vphaser2 -w 10000 -i data/02_align_to_ref/HSV-Klinik_S2.mapped.bam -o /home/jhuang/DATA/Data_Nicole_CaptureProbeSequencing/temp
    167. mkdir 02_align_to_NC_001806
    168. bin/read_utils.py align_and_fix data/01_per_sample/HSV1_S1.cleaned.bam refsel_db/NC_001806.2.fasta --outBamAll data/02_align_to_NC_001806/HSV1_S1.bam --outBamFiltered data/02_align_to_NC_001806/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    169. bin/read_utils.py align_and_fix data/01_per_sample/HSV-Klinik_S2.cleaned.bam refsel_db/NC_001806.2.fasta --outBamAll data/02_align_to_NC_001806/HSV-Klinik_S2.bam --outBamFiltered data/02_align_to_NC_001806/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    170. bin/intrahost.py vphaser_one_sample data/02_align_to_NC_001806/HSV1_S1.mapped.bam refsel_db/NC_001806.2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_NC_001806.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    171. bin/intrahost.py vphaser_one_sample data/02_align_to_NC_001806/HSV-Klinik_S2.mapped.bam refsel_db/NC_001806.2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_on_NC_001806.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    172. #align to self
    173. #-rw-rw-r-- 1 jhuang jhuang 47M Nov 8 18:24 HSV-Klinik_S2.bam
    174. #-rw-rw-r-- 1 jhuang jhuang 6,3M Nov 8 18:24 HSV-Klinik_S2.mapped.bam
    175. #-rw-rw-r-- 1 jhuang jhuang 74M Nov 8 17:25 HSV1_S1.bam
    176. #-rw-rw-r-- 1 jhuang jhuang 25K Nov 8 17:25 HSV1_S1.mapped.bam
    177. #align to NC_001806
    178. #-rw-rw-r-- 1 jhuang jhuang 48M Nov 11 13:26 HSV-Klinik_S2.bam
    179. #-rw-rw-r-- 1 jhuang jhuang 4,9M Nov 11 13:26 HSV-Klinik_S2.mapped.bam
    180. #-rw-rw-r-- 1 jhuang jhuang 74M Nov 11 13:31 HSV1_S1.bam
    181. #-rw-rw-r-- 1 jhuang jhuang 34K Nov 11 13:31 HSV1_S1.mapped.bam
    182. #align to OP297860
    183. #-rw-rw-r-- 1 jhuang jhuang 47M Nov 12 12:35 HSV-Klinik_S2.bam
    184. #-rw-rw-r-- 1 jhuang jhuang 5,3M Nov 12 12:35 HSV-Klinik_S2.mapped.bam
    185. #-rw-rw-r-- 1 jhuang jhuang 74M Nov 12 12:31 HSV1_S1.bam
    186. #-rw-rw-r-- 1 jhuang jhuang 34K Nov 12 12:31 HSV1_S1.mapped.bam
    187. #align to self
    188. #-rw-rw-r-- 1 jhuang jhuang 47M Nov 11 21:44 HSV-Klinik_S2.bam
    189. #-rw-rw-r-- 1 jhuang jhuang 6,3M Nov 11 21:44 HSV-Klinik_S2.mapped.bam
    190. #-rw-rw-r-- 1 jhuang jhuang 74M Nov 11 21:09 HSV1_S1.bam
    191. #-rw-rw-r-- 1 jhuang jhuang 25K Nov 11 21:09 HSV1_S1.mapped.bam
    192. # -- DEBUG_13 --
    193. [Mon Nov 11 15:36:54 2024]
    194. rule isnvs_vcf:
    195. input: data/04_intrahost/vphaser2.HSV1_S1.txt.gz, data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz, data/03_multialign_to_ref/aligned_1.fasta, ref_genome/reference.fasta
    196. output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
    197. jobid: 21
    198. resources: tmpdir=/tmp, mem=4
    199. b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    200. -------
    201. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    202. b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    203. -------
    204. 2024-11-11 15:36:55,581 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
    205. [GCC 7.3.0]
    206. 2024-11-11 15:36:55,581 - cmd:195:main_argparse - INFO - command: bin/intrahost.py merge_to_vcf refFasta=ref_genome/reference.fasta outVcf=data/04_intrahost/isnvs.vcf.gz samples=['HSV1_S1', 'HSV-Klinik_S2'] isnvs=['data/04_intrahost/vphaser2.HSV1_S1.txt.gz', 'data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz'] alignments=['data/03_multialign_to_ref/aligned_1.fasta'] strip_chr_version=True naive_filter=False parse_accession=True loglevel=INFO
    207. 2024-11-11 15:36:55,581 - intrahost:476:merge_to_vcf - INFO - loaded CoordMapper for all genomes, starting VCF merge...
    208. Traceback (most recent call last):
    209. File "bin/intrahost.py", line 1152, in <module>
    210. util.cmd.main_argparse(__commands__, __doc__)
    211. File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 221, in main_argparse
    212. ret = args.func_main(args)
    213. File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 102, in _main
    214. mainfunc(**args2)
    215. File "bin/intrahost.py", line 677, in merge_to_vcf
    216. (sample, (s_pos, samp_offsets[sample]), ref_sequence.id, pos))
    217. NotImplementedError: Sample HSV-Klinik_S2-1 has variants at 2 positions (8704, 8703) mapped to same reference position (AB291960.1:63)
    218. [Mon Nov 11 15:36:56 2024]
    219. Error in rule isnvs_vcf:
    220. jobid: 0
    221. output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
    222. RuleException:
    223. CalledProcessError in line 61 of /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules:
    224. Command 'set -euo pipefail; bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession' returned non-zero exit status 1.
    225. File "/mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules", line 61, in __rule_isnvs_vcf
    226. File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    227. Exiting because a job execution failed. Look above for error message
    228. Shutting down, this might take some time.
    229. Exiting because a job execution failed. Look above for error message
    230. Complete log: /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/.snakemake/log/2024-11-11T151925.063825.snakemake.log
    231. # --DEBUG_14 --
    232. bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120
    233. bin/read_utils.py bwamem_idxstats inBam=data/01_cleaned/HSV-Klinik_S2.cleaned.bam refFasta=/home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta outBam=None outStats=reports/spike_count/HSV-Klinik_S2.spike_count.txt min_score_to_filter=60 aligner_options=None loglevel=INFO tmp_dir=/tmp tmp_dirKeep=False loglevel=DEBUG
  14. Assembly results (look what are difference of the four versions 15K vs 73K in ~/DATA/Data_Nicole_CaptureProbeSequencing/tmp/02_assembly)

    1. HSV1_S1.assembly2-gapfilled.fasta vs HSV-Klinik_S2.assembly2-gapfilled.fasta
    2. -rw-rw-r-- 1 jhuang jhuang 15K Nov 8 17:12 HSV1_S1.assembly1-spades.fasta
    3. -rw-rw-r-- 1 jhuang jhuang 155K Nov 8 17:12 HSV1_S1.assembly2-scaffold_ref.fasta
    4. -rw-rw-r-- 1 jhuang jhuang 130K Nov 8 17:12 HSV1_S1.assembly2-scaffolded.fasta
    5. -rw-rw-r-- 1 jhuang jhuang 176 Nov 8 17:12 HSV1_S1.assembly2-alternate_sequences.fasta
    6. -rw-rw-r-- 1 jhuang jhuang 130K Nov 8 17:14 HSV1_S1.assembly2-gapfilled.fasta
    7. -rw-rw-r-- 1 jhuang jhuang 26 Nov 8 17:18 HSV1_S1.assembly3-modify.fasta.fai
    8. -rw-rw-r-- 1 jhuang jhuang 182 Nov 8 17:18 HSV1_S1.assembly3-modify.dict
    9. -rw-r--r-- 1 jhuang jhuang 1,7M Nov 8 17:18 HSV1_S1.assembly3-modify.nix
    10. -rw-rw-r-- 1 jhuang jhuang 155K Nov 8 17:18 HSV1_S1.assembly3-modify.fasta
    11. -rw-rw-r-- 1 jhuang jhuang 212 Nov 8 17:21 HSV1_S1.assembly3.vcf.gz.tbi
    12. -rw-rw-r-- 1 jhuang jhuang 183 Nov 8 17:21 HSV1_S1.assembly4-refined.dict
    13. -rw-rw-r-- 1 jhuang jhuang 26 Nov 8 17:21 HSV1_S1.assembly4-refined.fasta.fai
    14. -rw-r--r-- 1 jhuang jhuang 1,2M Nov 8 17:21 HSV1_S1.assembly4-refined.nix
    15. -rw-rw-r-- 1 jhuang jhuang 137K Nov 8 17:21 HSV1_S1.assembly4-refined.fasta
    16. -rw-rw-r-- 1 jhuang jhuang 494K Nov 8 17:21 HSV1_S1.assembly3.vcf.gz
    17. -rw-rw-r-- 1 jhuang jhuang 203 Nov 8 17:22 HSV1_S1.assembly4.vcf.gz.tbi
    18. -rw-rw-r-- 1 jhuang jhuang 428K Nov 8 17:22 HSV1_S1.assembly4.vcf.gz
    19. -rw-rw-r-- 1 jhuang jhuang 73K Nov 8 18:03 HSV-Klinik_S2.assembly1-spades.fasta
    20. -rw-rw-r-- 1 jhuang jhuang 144K Nov 8 18:03 HSV-Klinik_S2.assembly2-scaffolded.fasta
    21. -rw-rw-r-- 1 jhuang jhuang 0 Nov 8 18:03 HSV-Klinik_S2.assembly2-alternate_sequences.fasta
    22. -rw-rw-r-- 1 jhuang jhuang 155K Nov 8 18:03 HSV-Klinik_S2.assembly2-scaffold_ref.fasta
    23. -rw-rw-r-- 1 jhuang jhuang 144K Nov 8 18:07 HSV-Klinik_S2.assembly2-gapfilled.fasta
    24. -rw-rw-r-- 1 jhuang jhuang 32 Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.fasta.fai
    25. -rw-rw-r-- 1 jhuang jhuang 194 Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.dict
    26. -rw-r--r-- 1 jhuang jhuang 1,7M Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.nix
    27. -rw-rw-r-- 1 jhuang jhuang 155K Nov 8 18:12 HSV-Klinik_S2.assembly3-modify.fasta
  15. draw coverages

    1. * Mapping the contig on the reference JX878414
    2. bowtie2-build refsel_db/refsel.fasta refsel_index
    3. #spades/contigs.fasta
    4. #bowtie2 -f -x refsel_index -U HSV1_S1_vrap_out/HSV1_S1_contigs.fasta -N 1 --score-min L,0,-1 --rdg 5,3 --rfg 5,3 -S HSV1_S1_contigs_aligned.sam
    5. bowtie2 -f -x refsel_index -U HSV1_S1_vrap_out/HSV1_S1_contigs.fasta -S HSV1_S1_contigs_aligned.sam
    6. samtools view -bS -F 4 HSV1_S1_contigs_aligned.sam > HSV1_S1_contigs_aligned.bam
    7. #samtools view -S -b HSV1_S1_contigs_aligned.sam > HSV1_S1_contigs_aligned.bam
    8. samtools sort HSV1_S1_contigs_aligned.bam -o HSV1_S1_contigs_aligned_sorted.bam
    9. samtools index HSV1_S1_contigs_aligned_sorted.bam
    10. samtools view HSV1_S1_contigs_aligned_sorted.bam > HSV1_S1_contigs_aligned_sorted.sam
    11. Query sequence name Query length ORF density Percentage identity Subject sequence length Subject accession Subject name E-value
    12. #TODO: Analyis in next time consider keep the column of query_coverage for quality control?
    13. #2486 reads; of these:
    14. # 2486 (100.00%) were unpaired; of these:
    15. # 2407 (96.82%) aligned 0 times
    16. # 79 (3.18%) aligned exactly 1 time
    17. # 0 (0.00%) aligned >1 times
    18. #3.18% overall alignment rate
    19. 11 reads; of these:
    20. 11 (100.00%) were unpaired; of these:
    21. 8 (72.73%) aligned 0 times
    22. 3 (27.27%) aligned exactly 1 time
    23. 0 (0.00%) aligned >1 times
    24. 27.27% overall alignment rate
    25. NODE_14_length_862_cov_192.742857
    26. NODE_19_length_621_cov_61.380567
    27. CAP_16_length_559
    28. gi|946552631|gb|KT425109.1| Human alphaherpesvirus 1 strain KOS79
    29. gi|2549839763|gb|OQ724891.1| Human alphaherpesvirus 1 strain BP-K5
    30. gi|2228071600|gb|ON007132.1| Human alphaherpesvirus 1 strain v40_unk_gen
    31. samtools faidx HSV1_S1_contigs.fasta 'NODE_14_length_862_cov_192.742857' > HSV1_S1_contigs_.fasta
    32. samtools faidx HSV1_S1_contigs.fasta 'NODE_19_length_621_cov_61.380567' >> HSV1_S1_contigs_.fasta
    33. samtools faidx HSV1_S1_contigs.fasta 'CAP_16_length_559' >> HSV1_S1_contigs_.fasta
    34. bowtie2 -f -x refsel_index -U HSV-Klinik_S2_vrap_out/HSV-Klinik_S2_contigs.fasta -S HSV-Klinik_S2_contigs_aligned.sam
    35. samtools view -bS -F 4 HSV-Klinik_S2_contigs_aligned.sam > HSV-Klinik_S2_contigs_aligned.bam
    36. #samtools view -S -b HSV-Klinik_S2_contigs_aligned.sam > HSV-Klinik_S2_contigs_aligned.bam
    37. samtools sort HSV-Klinik_S2_contigs_aligned.bam -o HSV-Klinik_S2_contigs_aligned_sorted.bam
    38. samtools index HSV-Klinik_S2_contigs_aligned_sorted.bam
    39. samtools view HSV-Klinik_S2_contigs_aligned_sorted.bam > HSV-Klinik_S2_contigs_aligned_sorted.sam
    40. 31 reads; of these:
    41. 31 (100.00%) were unpaired; of these:
    42. 8 (25.81%) aligned 0 times
    43. 21 (67.74%) aligned exactly 1 time
    44. 2 (6.45%) aligned >1 times
    45. 74.19% overall alignment rate
    46. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_14_length_2544_cov_467.428217' > HSV-Klinik_S2_contigs_.fasta
    47. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_81_length_1225_cov_1080.820583' >> HSV-Klinik_S2_contigs_.fasta
    48. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_114_length_1046_cov_1018.474429' >> HSV-Klinik_S2_contigs_.fasta
    49. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_117_length_1033_cov_1618.421858' >> HSV-Klinik_S2_contigs_.fasta
    50. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_152_length_927_cov_105.347500' >> HSV-Klinik_S2_contigs_.fasta
    51. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_161_length_900_cov_3.283312' >> HSV-Klinik_S2_contigs_.fasta
    52. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_220_length_795_cov_0.748879' >> HSV-Klinik_S2_contigs_.fasta
    53. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_245_length_763_cov_900.518868' >> HSV-Klinik_S2_contigs_.fasta
    54. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_352_length_664_cov_61.363128' >> HSV-Klinik_S2_contigs_.fasta
    55. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_368_length_644_cov_489.846591' >> HSV-Klinik_S2_contigs_.fasta
    56. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_373_length_653_cov_0.340304' >> HSV-Klinik_S2_contigs_.fasta
    57. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_414_length_634_cov_2501.944773' >> HSV-Klinik_S2_contigs_.fasta
    58. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_626_length_568_cov_1.630385' >> HSV-Klinik_S2_contigs_.fasta
    59. samtools faidx HSV-Klinik_S2_contigs.fasta 'NODE_1026_length_506_cov_2.593668' >> HSV-Klinik_S2_contigs_.fasta
    60. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_7_length_1389' >> HSV-Klinik_S2_contigs_.fasta
    61. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_8_length_1267' >> HSV-Klinik_S2_contigs_.fasta
    62. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_9_length_1581' >> HSV-Klinik_S2_contigs_.fasta
    63. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_18_length_896' >> HSV-Klinik_S2_contigs_.fasta
    64. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_25_length_841' >> HSV-Klinik_S2_contigs_.fasta
    65. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_52_length_1849' >> HSV-Klinik_S2_contigs_.fasta
    66. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_53_length_665' >> HSV-Klinik_S2_contigs_.fasta
    67. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_54_length_820' >> HSV-Klinik_S2_contigs_.fasta
    68. samtools faidx HSV-Klinik_S2_contigs.fasta 'CAP_56_length_1189' >> HSV-Klinik_S2_contigs_.fasta
    69. gi|1059802459|gb|LT594105.1|
    70. gi|2315197778|gb|OP297860.1|
    71. gi|2549840487|gb|OQ724911.1|
    72. gi|1059802767|gb|LT594109.1|
    73. gi|2620238293|gb|OR771685.1|
    74. gi|2620238293|gb|OR771685.1|
    75. gi|2315199769|gb|OP297886.1|
    76. gi|2549841171|gb|OQ724933.1|
    77. gi|2620238293|gb|OR771685.1|
    78. gi|1809626902|gb|MN925871.1|
    79. gi|2618798953|gb|OR723971.1|
    80. gi|2315197778|gb|OP297860.1|
    81. gi|2277963097|gb|ON960059.1|
    82. gi|2620238293|gb|OR771685.1|
    83. gi|2549599151|gb|OQ724836.1|
    84. gi|1717903527|gb|MN136523.1|
    85. gi|1059802459|gb|LT594105.1|
    86. gi|2549841171|gb|OQ724933.1|
    87. gi|2315197778|gb|OP297860.1|
    88. gi|2620238293|gb|OR771685.1|
    89. gi|2315197778|gb|OP297860.1|
    90. gi|1809626902|gb|MN925871.1|
    91. Human herpesvirus 1 isolate 172_2010 genome assembly
    92. Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
    93. Human alphaherpesvirus 1 strain BP-K12
    94. Human herpesvirus 1 isolate 270_2007 genome assembly
    95. Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
    96. Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
    97. Human alphaherpesvirus 1 strain HSV1-v72_d53_cu_gen_les
    98. Human alphaherpesvirus 1 strain BP-L2
    99. Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
    100. UNVERIFIED: Human alphaherpesvirus 1 strain Sample4_DOCK8
    101. Mutant Human alphaherpesvirus 1 isolate dsncRNA12
    102. Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
    103. Human alphaherpesvirus 1 strain HSV1-San-Francisco-USA-1974-HTZ
    104. Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
    105. UNVERIFIED: Human alphaherpesvirus 1 strain BP-C8
    106. Human alphaherpesvirus 1 strain MacIntyre
    107. Human herpesvirus 1 isolate 172_2010 genome assembly
    108. Human alphaherpesvirus 1 strain BP-L2
    109. Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
    110. Human alphaherpesvirus 1 isolate HSV1/USA/WA-UW-2L9/2020
    111. Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les
    112. UNVERIFIED: Human alphaherpesvirus 1 strain Sample4_DOCK8
    113. Human alphaherpesvirus 1 strain HSV1-v67_d346_cu_gen_les
    114. #-->OR771685.1
    115. #8278 reads; of these:
    116. # 8278 (100.00%) were unpaired; of these:
    117. # 3775 (45.60%) aligned 0 times
    118. # 4500 (54.36%) aligned exactly 1 time
    119. # 3 (0.04%) aligned >1 times
    120. #54.40% overall alignment rate
    121. * Generate Coverage Profile for Reads (from Fastq): Align the trimmed fastq reads to the reference genome using a mapper like BWA or Bowtie2 (WRONG), we should use novoalign
    122. #bwa index refsel_db/refsel.fasta
    123. #bwa mem refsel_db/refsel.fasta trimmed/HSV1_S1_R1.fastq.gz trimmed/HSV1_S1_R2.fastq.gz > HSV1_S1_reads_aligned.sam
    124. #samtools view -Sb HSV1_S1_reads_aligned.sam | samtools sort -o HSV1_S1_reads_aligned_sorted.bam
    125. #samtools index HSV1_S1_reads_aligned_sorted.bam
    126. #bwa mem refsel_db/refsel.fasta trimmed/HSV-Klinik_S2_R1.fastq.gz trimmed/HSV-Klinik_S2_R2.fastq.gz > HSV-Klinik_S2_reads_aligned.sam
    127. #samtools view -Sb HSV-Klinik_S2_reads_aligned.sam | samtools sort -o HSV-Klinik_S2_reads_aligned_sorted.bam
    128. #samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
    129. cd data
    130. mkdir 02_align_to_OP297860
    131. ../bin/read_utils.py align_and_fix 01_per_sample/HSV1_S1.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV1_S1.bam --outBamFiltered 02_align_to_OP297860/HSV1_S1.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    132. ../bin/read_utils.py align_and_fix 01_per_sample/HSV-Klinik_S2.cleaned.bam ../refsel_db/refsel.fasta --outBamAll 02_align_to_OP297860/HSV-Klinik_S2.bam --outBamFiltered 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam --aligner novoalign --aligner_options '-r Random -l 20 -g 40 -x 20 -t 100 -k' --threads 120
    133. samtools sort 02_align_to_OP297860/HSV1_S1.mapped.bam -o HSV1_S1_reads_aligned_sorted.bam
    134. samtools index HSV1_S1_reads_aligned_sorted.bam
    135. samtools sort 02_align_to_OP297860/HSV-Klinik_S2.mapped.bam -o HSV-Klinik_S2_reads_aligned_sorted.bam
    136. samtools index HSV-Klinik_S2_reads_aligned_sorted.bam
    137. mv 02_align_to_OP297860/*.bam ..
    138. rmdir 02_align_to_OP297860
    139. * Generate Coverage Tracks: Use BamCoverage to generate coverage files (in bigWig format) for both the reads and contigs.
    140. #find . -name "*_aligned_sorted.bam"
    141. bamCoverage -b ./HSV1_S1_reads_aligned_sorted.bam -o HSV1_S1_reads_coverage.bw
    142. bamCoverage -b ./HSV1_S1_contigs_aligned_sorted.bam -o HSV1_S1_contigs_coverage.bw
    143. bamCoverage -b ./HSV-Klinik_S2_reads_aligned_sorted.bam -o HSV-Klinik_S2_reads_coverage.bw
    144. bamCoverage -b ./HSV-Klinik_S2_contigs_aligned_sorted.bam -o HSV-Klinik_S2_contigs_coverage.bw
    145. * Visualize Alignments: Use tools like IGV (Integrative Genomics Viewer)
  16. Reproduce 03_multialign_to_ref by generating consensus fasta

    1. #bedtools bamtobed -i HSV-Klinik_S2_contigs_aligned_sorted.bam > contigs.bed
    2. bedtools bamtobed -i HSV1_S1_vrap_out_v5/bowtie/mapped_sorted.bam > contigs.bed
    3. bedtools merge -i contigs.bed > merged_contigs_coverage.bed
    4. awk '{sum += $3 - $2} END {print sum}' merged_contigs_coverage.bed
    5. #20916
    6. #generate alignment form contigs.bam and refsel.fasta
    7. bcftools mpileup -f refsel_db/refsel.fasta -d 1000000 HSV-Klinik_S2_contigs_aligned_sorted.bam | bcftools call -mv --ploidy 1 -Ov -o contigs_variants.vcf
    8. bgzip contigs_variants.vcf
    9. tabix -p vcf contigs_variants.vcf.gz
    10. cat refsel_db/refsel.fasta | bcftools consensus contigs_variants.vcf.gz > aligned_contigs_to_reference.fasta
    11. # tabix -p vcf contigs_variants.vcf.gz
    12. # cat refsel_db/refsel.fasta | bcftools consensus contigs_variants.vcf.gz > aligned_contigs_to_reference.fasta
    13. #Note: the --sample option not given, applying all records regardless of the genotype
    14. #Applied 30 variants
    15. cat refsel_db/refsel.fasta aligned_contigs_to_reference.fasta > aligned_1.fasta
    16. #Header of the 2nd record is >HSV-Klinik_S2-1
    17. mafft aligned_1.fasta | sed '/^>/! s/[a-z]/\U&/g' > data/03_multialign_to_ref/aligned_1.fasta
  17. Reproduce 04_intrahost, #DEBUG_IMPORTANT_NOT_SAME_BETWEEN_VPHASER2_AND_FREEBAYES: why not intrahost variant calling not having the frequencies between 0.2 and 0.8. The list is also total different to the results from freebayes. try different combination of ""--removeDoublyMappedReads --minReadsEach 5 --maxBias 0"

    1. awk '$6 >= 0.05' isnvs.annot.txt > isnvs.annot_.txt
    2. chr pos sample patient time alleles iSNV_freq Hw Hs eff_type eff_codon_dna eff_aa eff_aa_pos eff_prot_len eff_gene eff_protein
    3. * OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1 0.0165025249227804 1 synonymous_variant,intragenic_variant 1614A>G,1614A>T,n.13203T>C,n.13203T>A Val538Val 538 882 UL5 UXY89136.1,Gene_11440_14815
    4. * OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.992975413948821 0.0139504824839776 1 missense_variant 1126A>C Asn376His 376 376 UL23 UXY89153.1
    5. OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.0537303216974675 0.101686748455508 1 synonymous_variant 246C>A Pro82Pro 82 376 UL23 UXY89153.1
    6. OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1 0.0204843614284831 1 synonymous_variant,intragenic_variant 720A>G,720A>T,n.55501T>C,n.55501T>A Ala240Ala 240 904 UL27,UL28 UXY89158.1,Gene_53483_58584
    7. OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.0622837370242215 0.116808946253038 1 missense_variant,intragenic_variant 414G>T,n.55807C>A Glu138Asp 138 904 UL27,UL28 UXY89158.1,Gene_53483_58584
    8. * OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.891530460624071 0.193407796807005 1 intragenic_variant n.65225G>A UL30 Gene_63070_67475
    9. * OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.102222222222222 0.183545679012346 1 intragenic_variant n.65402C>A UL30 Gene_63070_67475
    10. OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.0518433179723502 0.0983111767079359 1 intragenic_variant n.66570G>T UL30 Gene_63070_67475
    11. OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.0528511821974965 0.100115869475647 1 missense_variant 108G>T Gln36His 36 488 UL42 UXY89171.1
    12. samtools faidx aligned_1.fasta "OP297860.1":13203-13203 #T
    13. samtools faidx aligned_1.fasta HSV-Klinik_S2-1:13203-13203 #T
    14. samtools faidx aligned_1.fasta "OP297860.1":47109-47109 #T
    15. samtools faidx aligned_1.fasta HSV-Klinik_S2-1:47109-47109 #T
    16. samtools faidx aligned_1.fasta "OP297860.1":47989-47989 #G
    17. samtools faidx aligned_1.fasta HSV-Klinik_S2-1:47989-47989 #G
    18. samtools faidx aligned_1.fasta "OP297860.1":65225-65225 #G
    19. samtools faidx aligned_1.fasta HSV-Klinik_S2-1:65225-65225 #A
    20. #DEBUG_IMPORTANT_NOT_SAME_BETWEEN_VPHASER2_AND_FREEBAYES: why not intrahost variant calling not located in 0.6, 0.4
    21. vim bin/tools/vphaser2.py # set w=25000
    22. rm -rf data/04_intrahost
    23. snakemake --printshellcmds --cores 10
    24. samtools index data/02_align_to_self/HSV1_S1.mapped.bam
    25. samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    26. bin/interhost.py multichr_mafft ref_genome/reference.fasta data/02_assembly/HSV-Klinik_S2.fasta data/03_multialign_to_ref --ep 0.123 --maxiters 1000 --preservecase --localpair --outFilePrefix aligned --sampleNameListFile data/03_multialign_to_ref/sampleNameList.txt --threads 120 -loglevel DEBUG
    27. #interhost variant calling, the number below should be not the same to the intrahost variant calling (the varaints from the isolate to its consensus assemby, this is why the frequency theoretically under 0.5. In intrahost variant calling, the REF refers to the base OP297860.1. It is possible that a ALT has 90% in the clinical samples --> All positions with > 0.5 means the consensus sequences are different to the CHROM. The frequences varies 0.00000001 to 1.0, since if the frequences with 0.0 will be not reported.)
    28. #The contigs contains a lot of positions wrongly assembled, so it is actually only much fewer following positions are interhost variants.
    29. samtools index HSV1_S1.mapped.bam
    30. samtools index HSV-Klinik_S2.mapped.bam
    31. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 0 --loglevel DEBUG
    32. awk '$7 >= 5' vphaser2.HSV-Klinik_S2_v2.txt > vphaser2.HSV-Klinik_S2_v2_.txt
    33. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 0 --loglevel DEBUG
  18. Manully complete the assemblies with the reference genome and recreated 02_assembly, then rerun the pipelines for the steps after 02_align_to_self

    1. ~/Scripts/convert_fasta_to_clustal.py aligned_1.fasta_orig aligned_1.aln
    2. ~/Scripts/convert_clustal_to_clustal.py aligned_1.aln aligned_1_.aln
    3. #manully delete the postion with all or '-' in aligned_1_.aln
    4. ~/Scripts/check_sequence_differences.py aligned_1_.aln
    5. #Differences found at the following positions (150):
    6. Position 8956: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    7. Position 8991: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    8. Position 8992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    9. Position 8995: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    10. Position 9190: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    11. Position 9294: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    12. Position 9298: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    13. Position 9319: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    14. Position 9324: OP297860.1 = T, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    15. Position 9352: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    16. Position 9368: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    17. Position 10036: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    18. Position 12006: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    19. Position 12131: OP297860.1 = C, HSV1_S1-1 = M, HSV-Klinik_S2-1 = C
    20. Position 12748: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    21. Position 12753: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    22. * Position 13203: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    23. * Position 13522: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    24. Position 13557: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    25. Position 13637: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    26. * Position 13659: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    27. Position 13731: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    28. Position 13755: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    29. Position 13778: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    30. Position 14835: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    31. Position 34549: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    32. Position 34705: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    33. Position 41118: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    34. Position 41422: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    35. Position 44110: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    36. Position 44137: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    37. Position 44190: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    38. Position 44227: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    39. Position 44295: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    40. Position 46861: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    41. # Position 47109: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    42. Position 47170: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = T
    43. Position 47182: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    44. Position 47320: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    45. Position 47375: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    46. Position 47377: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    47. Position 47393: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    48. Position 47433: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    49. Position 47436: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    50. Position 47484: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    51. Position 47516: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    52. Position 47563: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    53. Position 47660: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    54. Position 47707: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    55. Position 47722: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = G
    56. * Position 47969: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    57. Position 48064: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = A
    58. Position 48113: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    59. Position 48129: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    60. Position 48167: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    61. Position 48219: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    62. Position 48255: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    63. Position 48384: OP297860.1 = C, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
    64. Position 53216: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    65. Position 53254: OP297860.1 = C, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
    66. Position 53265: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    67. Position 53291: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    68. Position 53298: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = G
    69. Position 53403: OP297860.1 = C, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    70. Position 53423: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    71. Position 53445: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    72. Position 53450: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    73. Position 53460: OP297860.1 = A, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    74. Position 53659: OP297860.1 = A, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    75. * Position 53691: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    76. Position 54007: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    77. Position 54013: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    78. Position 54025: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    79. Position 54073: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    80. Position 54408: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    81. Position 54568: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    82. Position 54708: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    83. Position 54709: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    84. * Position 55501: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    85. Position 55507: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    86. Position 55543: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    87. Position 56493: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    88. Position 56753: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    89. Position 56981: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    90. Position 58075: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    91. Position 58078: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    92. Position 58526: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    93. Position 58550: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    94. Position 58604: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    95. Position 58615: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    96. Position 58789: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    97. * Position 63248: OP297860.1 = G, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    98. Position 63799: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    99. * Position 64328: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    100. Position 65179: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    101. * Position 65225: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    102. Position 65992: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    103. Position 66677: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    104. Position 67336: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    105. Position 87848: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    106. Position 87866: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    107. Position 87942: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    108. Position 87949: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    109. * Position 95302: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    110. Position 95320: OP297860.1 = G, HSV1_S1-1 = K, HSV-Klinik_S2-1 = G
    111. Position 95992: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    112. Position 96124: OP297860.1 = G, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    113. Position 96138: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    114. Position 96145: OP297860.1 = C, HSV1_S1-1 = A, HSV-Klinik_S2-1 = C
    115. Position 100159: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    116. Position 107885: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    117. Position 114972: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    118. Position 117663: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    119. Position 117802: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = A
    120. Position 117834: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    121. Position 117841: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    122. Position 118616: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    123. Position 119486: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    124. Position 119519: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    125. Position 120688: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    126. Position 120690: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    127. Position 120711: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    128. Position 120714: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    129. Position 133842: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    130. Position 133894: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = T
    131. Position 134778: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    132. Position 134788: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    133. Position 134867: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    134. Position 134895: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    135. Position 134898: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    136. Position 134942: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = G
    137. Position 136436: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    138. Position 136900: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = A
    139. Position 137047: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    140. Position 137155: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    141. Position 137527: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    142. Position 137569: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    143. Position 137602: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    144. Position 137944: OP297860.1 = T, HSV1_S1-1 = A, HSV-Klinik_S2-1 = T
    145. Position 138170: OP297860.1 = T, HSV1_S1-1 = C, HSV-Klinik_S2-1 = C
    146. Position 138343: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    147. Position 138880: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    148. Position 139104: OP297860.1 = T, HSV1_S1-1 = T, HSV-Klinik_S2-1 = C
    149. Position 140457: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = M
    150. Position 141865: OP297860.1 = A, HSV1_S1-1 = A, HSV-Klinik_S2-1 = G
    151. Position 141889: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = A
    152. Position 141937: OP297860.1 = G, HSV1_S1-1 = G, HSV-Klinik_S2-1 = C
    153. Position 142056: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = G
    154. Position 144444: OP297860.1 = C, HSV1_S1-1 = C, HSV-Klinik_S2-1 = T
    155. ~/Scripts/convert_clustal_to_fasta.py aligned_1_.aln aligned_1.fasta
    156. samtools faidx aligned_1.fasta
    157. samtools faidx aligned_1.fasta OP297860.1 > OP297860.1.fasta
    158. samtools faidx aligned_1.fasta HSV1_S1-1 > HSV1_S1-1.fasta
    159. samtools faidx aligned_1.fasta HSV-Klinik_S2-1 > HSV-Klinik_S2-1.fasta
    160. seqkit seq OP297860.1.fasta -w 70 > OP297860.1_w70.fasta
    161. diff OP297860.1_w70.fasta ../../refsel_db/refsel.fasta
    162. #< >OP297860.1
    163. #---
    164. #> >OP297860.1 Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les, complete genome
    165. #2180c2180,2181
    166. #< ACGGGCCCCCCCCCGAAACACACCCCCCGGGGGTCGCGCGCGGCCCTT
    167. #---
    168. #> ACGGGCCCCCCCCCGAAACACACCCCCCGGGGGTCGCGCGCGGCCCTTTAAAAAGGCGGGGCGGGT
    169. mv 02_assembly 02_assembly_v1
    170. mv 02_align_to_self 02_align_to_self_v1
    171. mv 03_multialign_to_ref/ 03_multialign_to_ref_v1
    172. mv 04_intrahost 04_intrahost_v1
    173. mkdir 02_assembly
    174. cp 03_multialign_to_ref_v1/HSV1_S1-1.fasta 02_assembly/HSV1_S1.fasta
    175. cp 03_multialign_to_ref_v1/HSV-Klinik_S2-1.fasta 02_assembly/HSV-Klinik_S2.fasta
    176. samtools faidx HSV1_S1.fasta
    177. picard CreateSequenceDictionary R=HSV1_S1.fasta O=HSV1_S1.dict
    178. ~/Tools/novocraft_v3/novoindex HSV1_S1.nix HSV1_S1.fasta
    179. samtools faidx HSV-Klinik_S2.fasta
    180. picard CreateSequenceDictionary R=HSV-Klinik_S2.fasta O=HSV-Klinik_S2.dict
    181. ~/Tools/novocraft_v3/novoindex HSV-Klinik_S2.nix HSV-Klinik_S2.fasta
  19. If the reads in mapped.bam too few, we can manully rerun the following steps with custom defined bam, for example cleaned.bam or taxfilt.bam files (see the point 1).

    1. # -- Adjust Novoalign parameter to increase the mapped reads in 02_align_to_self --
    2. If you are working with NovoAlign for virus variant calling and find that very few reads are retained, you can adjust certain parameters to increase the read count while still maintaining high mapping quality. Here are some suggestions for tuning the parameters in NovoAlign:
    3. Reduce the Minimum Alignment Score Threshold (-t):
    4. Current Setting: -t 100
    5. Suggestion: Try reducing this threshold to around -t 90 or -t 80.
    6. Explanation: The -t parameter in NovoAlign sets the minimum alignment score, which is the threshold for accepting an alignment. Lowering this score allows more alignments to pass through, increasing read retention. Reducing it slightly can retain quality while increasing the number of mapped reads.
    7. Adjust the Gap Penalty (-g):
    8. Current Setting: -g 40
    9. Suggestion: Try using a slightly lower gap penalty, such as -g 20 or -g 30.
    10. Explanation: Lowering the gap penalty allows reads with minor gaps to align more easily, which may be beneficial for viral genomes with regions that might induce small indels. This adjustment should increase read retention without sacrificing too much mapping quality.
    11. Lower the Mismatch Penalty (-x):
    12. Current Setting: -x 20
    13. Suggestion: Try reducing this to -x 15 or -x 10.
    14. Explanation: A lower mismatch penalty allows more reads with minor mismatches to map, increasing the number of mapped reads. For viral genomes, this can be helpful because some variability is expected, especially in variant-calling workflows.
    15. Experiment with the Random Alignment Option (-r):
    16. Current Setting: -r Random
    17. Suggestion: If applicable, you might try other random alignment settings in NovoAlign or disable it temporarily to see if deterministic behavior (i.e., -r All) provides more reads without sacrificing quality.
    18. Explanation: This option controls how NovoAlign treats random alignments. Testing with or without it may affect read retention, especially if many reads align equally well to multiple sites.
    19. Increase the Soft-Clipping Parameter (-l):
    20. Current Setting: -l 20
    21. Suggestion: Try increasing to -l 30 or -l 40.
    22. Explanation: Higher soft-clipping allows NovoAlign to discard low-quality or mismatched bases at the read ends, which can improve alignment quality and retention by allowing reads that otherwise would be discarded due to terminal mismatches.
    23. Consider Using Paired-End Data:
    24. Suggestion: If you have paired-end data available, align the reads as paired rather than single-ended.
    25. Explanation: Paired-end alignment can improve mapping quality and retention since the additional read information helps resolve ambiguous mappings. It also increases the reliability of alignments by adding context from both read ends.
    26. Sample Adjusted Command
    27. Heres an example of a modified NovoAlign command incorporating the above suggestions:
    28. novoalign -d reference.nix -f reads.fq -r Random -l 40 -g 30 -x 15 -t 80 -k > output.sam
    29. Additional Steps for High-Quality Variant Calling:
    30. Use a Quality Filter Post-Alignment: After alignment, apply a quality filter on the mapped reads using a tool like Samtools to ensure that only high-confidence alignments are used for variant calling.
    31. Optimize Read Processing: Trim low-quality bases and remove adapters prior to alignment to ensure that only high-quality reads are used. This will increase both the retention rate and the quality of the alignments.
    32. Review Variant Calling Parameters: After alignment, check the variant-calling parameters to ensure they are suitable for low-complexity viral genomes and that high-quality mapping is prioritized.
    33. These adjustments should help you retain more reads while still maintaining alignment quality suitable for variant calling. Adjust these parameters iteratively, evaluating the resulting alignments and variant calls to find the best balance between read count and quality.
    34. # (TODO) look which configuration can reach the similar results as the freebayes?
    35. vim bin/tools/vphaser2.py # edit '-w 22000'
    36. mkdir 04_intrahost
    37. #[E::idx_find_and_load] Could not retrieve index file for 'data/02_align_to_self/HSV-Klinik_S2.mapped.bam'
    38. #[E::idx_find_and_load] Could not retrieve index file for 'data/02_align_to_self/HSV-Klinik_S2.mapped.bam'
    39. samtools index data/02_align_to_self/HSV1_S1.mapped.bam
    40. samtools index data/02_align_to_self/HSV-Klinik_S2.mapped.bam
    41. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w22000.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 1000000 --loglevel DEBUG
    42. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w22000.txt --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    43. #---- (Maybe next time, this time, it is not necessary): running once for l20_g40_x20_t100, once for l40_g30_x15_t80, which is option for novoalign in config.yaml, Note that we need rerun rerun 02_align_to_self.
    44. # -- 04_intrahost_--removeDoublyMappedReads_--minReadsEach5_--maxBias10 --
    45. mkdir data/04_intrahost
    46. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    47. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    48. #bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
    49. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    50. bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    51. bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    52. mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_removeDoublyMappedReads_minReadsEach5_maxBias10
    53. cd data/04_intrahost_l20_g40_x20_t100_removeDoublyMappedReads_minReadsEach5_maxBias10
    54. gunzip isnvs.annot.txt.gz
    55. ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    56. cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    57. chr pos sample patient time alleles iSNV_freq
    58. OP297860 13203 HSV1_S1 HSV1_S1 T,C,A 1.0
    59. OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
    60. OP297860 13522 HSV1_S1 HSV1_S1 G,T 1.0
    61. OP297860 13522 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008905554253573941
    62. OP297860 13659 HSV1_S1 HSV1_S1 G,T 1.0
    63. OP297860 13659 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008383233532934131
    64. OP297860 47109 HSV1_S1 HSV1_S1 T,G 0.0
    65. OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.9929754139488208
    66. OP297860 47969 HSV1_S1 HSV1_S1 C,T,A 1.0
    67. OP297860 47969 HSV-Klinik_S2 HSV-Klinik_S2 C,T,A 0.017707985299031073
    68. OP297860 47989 HSV1_S1 HSV1_S1 G,T 0.0
    69. OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.053730321697467484
    70. OP297860 53691 HSV1_S1 HSV1_S1 G,T 1.0
    71. OP297860 53691 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.023529411764705882
    72. OP297860 55501 HSV1_S1 HSV1_S1 T,C,A 1.0
    73. OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
    74. OP297860 55807 HSV1_S1 HSV1_S1 C,A 0.0
    75. OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.062176165803108814
    76. OP297860 63248 HSV1_S1 HSV1_S1 G,T 1.0
    77. OP297860 63248 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.016983016983016984
    78. OP297860 64328 HSV1_S1 HSV1_S1 C,A 1.0
    79. OP297860 64328 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.008469449485783423
    80. OP297860 65225 HSV1_S1 HSV1_S1 G,A 0.0
    81. OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.8915304606240714
    82. OP297860 65402 HSV1_S1 HSV1_S1 C,A 0.0
    83. OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.10222222222222224
    84. OP297860 66570 HSV1_S1 HSV1_S1 G,T 0.0
    85. OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05144291091593475
    86. OP297860 94750 HSV1_S1 HSV1_S1 G,T 0.0
    87. OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.052851182197496516
    88. OP297860 95302 HSV1_S1 HSV1_S1 C,A 1.0
    89. OP297860 95302 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.01276595744680851
    90. #mv data/04_intrahost data/04_intrahost_l40_g30_x15_t80_removeDoublyMappedReads_minReadsEach5_maxBias10
    91. #cd data/04_intrahost_l40_g30_x15_t80_removeDoublyMappedReads_minReadsEach5_maxBias10
    92. #gunzip isnvs.annot.txt.gz
    93. #Keep groups where at least one record has iSNV_freq >= 0.05
    94. #~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    95. #cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    96. # -- 04_intrahost_--minReadsEach5_--maxBias10 --
    97. mkdir data/04_intrahost
    98. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10
    99. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10
    100. #bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
    101. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    102. bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    103. bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    104. mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias10
    105. cd data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias10
    106. gunzip isnvs.annot.txt.gz
    107. ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    108. cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    109. chr pos sample patient time alleles iSNV_freq
    110. OP297860 13203 HSV1_S1 HSV1_S1 T,C,A 1.0
    111. OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
    112. OP297860 13522 HSV1_S1 HSV1_S1 G,T 1.0
    113. OP297860 13522 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008888888888888889
    114. OP297860 13659 HSV1_S1 HSV1_S1 G,T 1.0
    115. OP297860 13659 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008359207069500836
    116. OP297860 47109 HSV1_S1 HSV1_S1 T,G 0.0
    117. OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.9930174563591022
    118. OP297860 47969 HSV1_S1 HSV1_S1 C,T,A 1.0
    119. OP297860 47969 HSV-Klinik_S2 HSV-Klinik_S2 C,T,A 0.01828457446808511
    120. OP297860 47989 HSV1_S1 HSV1_S1 G,T 0.0
    121. OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.053474114441416885
    122. OP297860 53691 HSV1_S1 HSV1_S1 G,T 1.0
    123. OP297860 53691 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.02342786683107275
    124. OP297860 55501 HSV1_S1 HSV1_S1 T,C,A 1.0
    125. OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
    126. OP297860 55807 HSV1_S1 HSV1_S1 C,A 0.0
    127. OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.061538461538461535
    128. OP297860 63248 HSV1_S1 HSV1_S1 G,T 1.0
    129. OP297860 63248 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.016815034619188922
    130. OP297860 64328 HSV1_S1 HSV1_S1 C,A 1.0
    131. OP297860 64328 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.008433734939759036
    132. OP297860 65225 HSV1_S1 HSV1_S1 G,A 0.0
    133. OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.8916728076639646
    134. OP297860 65402 HSV1_S1 HSV1_S1 C,A 0.0
    135. OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.1018149623727313
    136. OP297860 66570 HSV1_S1 HSV1_S1 G,T 0.0
    137. OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05112219451371571
    138. OP297860 94750 HSV1_S1 HSV1_S1 G,T 0.0
    139. OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.052851182197496516
    140. OP297860 95302 HSV1_S1 HSV1_S1 C,A 1.0
    141. OP297860 95302 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.012725344644750796
    142. # -- 04_intrahost_--minReadsEach5_--maxBias1000000 --
    143. mkdir data/04_intrahost
    144. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000
    145. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000
    146. #bin/read_utils.py bwamem_idxstats data/01_cleaned/HSV-Klinik_S2.cleaned.bam /home/jhuang/REFs/viral_ngs_dbs/spikeins/ercc_spike-ins.fasta --outStats reports/spike_count/HSV-Klinik_S2.spike_count.txt --minScoreToFilter 60
    147. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    148. bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    149. bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    150. mv data/04_intrahost data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias1000000
    151. cd data/04_intrahost_l20_g40_x20_t100_minReadsEach5_maxBias1000000
    152. gunzip isnvs.annot.txt.gz
    153. ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    154. cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    155. chr pos sample patient time alleles iSNV_freq
    156. OP297860 13203 HSV1_S1 HSV1_S1 T,C,A 1.0
    157. OP297860 13203 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
    158. OP297860 13522 HSV1_S1 HSV1_S1 G,T 1.0
    159. OP297860 13522 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008888888888888889
    160. OP297860 13659 HSV1_S1 HSV1_S1 G,T 1.0
    161. OP297860 13659 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.008359207069500836
    162. OP297860 47109 HSV1_S1 HSV1_S1 T,G 0.0
    163. OP297860 47109 HSV-Klinik_S2 HSV-Klinik_S2 T,G 0.9930174563591022
    164. OP297860 47778 HSV1_S1 HSV1_S1 G,T 0.0
    165. OP297860 47778 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05263157894736842
    166. OP297860 47969 HSV1_S1 HSV1_S1 C,T,A 1.0
    167. OP297860 47969 HSV-Klinik_S2 HSV-Klinik_S2 C,T,A 0.01828457446808511
    168. OP297860 47989 HSV1_S1 HSV1_S1 G,T 0.0
    169. OP297860 47989 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.053474114441416885
    170. OP297860 53691 HSV1_S1 HSV1_S1 G,T 1.0
    171. OP297860 53691 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.02342786683107275
    172. OP297860 55501 HSV1_S1 HSV1_S1 T,C,A 1.0
    173. OP297860 55501 HSV-Klinik_S2 HSV-Klinik_S2 T,C,A 1.0
    174. OP297860 55807 HSV1_S1 HSV1_S1 C,A 0.0
    175. OP297860 55807 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.061538461538461535
    176. OP297860 63248 HSV1_S1 HSV1_S1 G,T 1.0
    177. OP297860 63248 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.016815034619188922
    178. OP297860 64328 HSV1_S1 HSV1_S1 C,A 1.0
    179. OP297860 64328 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.008433734939759036
    180. OP297860 65225 HSV1_S1 HSV1_S1 G,A 0.0
    181. OP297860 65225 HSV-Klinik_S2 HSV-Klinik_S2 G,A 0.8916728076639646
    182. OP297860 65402 HSV1_S1 HSV1_S1 C,A 0.0
    183. OP297860 65402 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.1018149623727313
    184. OP297860 66570 HSV1_S1 HSV1_S1 G,T 0.0
    185. OP297860 66570 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.05112219451371571
    186. OP297860 94750 HSV1_S1 HSV1_S1 G,T 0.0
    187. OP297860 94750 HSV-Klinik_S2 HSV-Klinik_S2 G,T 0.052851182197496516
    188. OP297860 95302 HSV1_S1 HSV1_S1 C,A 1.0
    189. OP297860 95302 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.012725344644750796
  20. Install a new viral-ngs including interhost and annotation steps (Failed!)

    1. #https://viral-ngs.readthedocs.io/en/latest/install.html
    2. wget https://raw.githubusercontent.com/broadinstitute/viral-ngs/master/easy-deploy-script/easy-deploy-viral-ngs.sh && chmod a+x ./easy-deploy-viral-ngs.sh && reuse UGER && qrsh -l h_vmem=10G -cwd -N "viral-ngs_deploy" -q interactive ./easy-deploy-viral-ngs.sh setup
    3. source ./easy-deploy-viral-ngs.sh load
    4. ./easy-deploy-viral-ngs.sh create-project HSV1_Capture
    5. #docker installation
    6. sudo usermod -aG docker jhuang
    7. #newgrp docker
    8. groups jhuang
    9. docker pull quay.io/broadinstitute/viral-ngs
    10. docker run -it quay.io/broadinstitute/viral-ngs /bin/bash
  21. Note that the intrahost results does not include the interhost results. Checking process.

    1. #Under data/02_assembly
    2. cp ../../ref_genome/reference.fasta HSV1_S1.fasta #>HSV1_S1-1
    3. cp ../../ref_genome/reference.fasta HSV-Klinik_S2.fasta #>HSV-Klinik_S2-1
    4. samtools faidx HSV1_S1.fasta
    5. picard CreateSequenceDictionary R=HSV1_S1.fasta O=HSV1_S1.dict
    6. ~/Tools/novocraft_v3/novoindex HSV1_S1.nix HSV1_S1.fasta
    7. samtools faidx HSV-Klinik_S2.fasta
    8. picard CreateSequenceDictionary R=HSV-Klinik_S2.fasta O=HSV-Klinik_S2.dict
    9. ~/Tools/novocraft_v3/novoindex HSV-Klinik_S2.nix HSV-Klinik_S2.fasta
    10. #total 128140
    11. #-rw-rw-r-- 1 jhuang jhuang 76693037 Nov 13 09:59 HSV1_S1.bam
    12. #-rw-rw-r-- 1 jhuang jhuang 34590 Nov 13 09:59 HSV1_S1.mapped.bam
    13. #-rw-rw-r-- 1 jhuang jhuang 48946378 Nov 13 10:03 HSV-Klinik_S2.bam
    14. #-rw-rw-r-- 1 jhuang jhuang 5537247 Nov 13 10:03 HSV-Klinik_S2.mapped.bam
    15. # vs
    16. #total 128140
    17. #-rw-rw-r-- 1 jhuang jhuang 76693095 Nov 15 12:47 HSV1_S1.bam
    18. #-rw-rw-r-- 1 jhuang jhuang 34587 Nov 15 12:47 HSV1_S1.mapped.bam
    19. #-rw-rw-r-- 1 jhuang jhuang 48946337 Nov 15 12:48 HSV-Klinik_S2.bam
    20. #-rw-rw-r-- 1 jhuang jhuang 5537246 Nov 15 12:48 HSV-Klinik_S2.mapped.bam
    21. #Manually generate the aligned_1.fasta due to too long runtime.
    22. cat ../../ref_genome/reference.fasta ../02_assembly/HSV1_S1.fasta ../02_assembly/HSV-Klinik_S2.fasta > aligned_1.fasta
    23. #>OP297860.1 Human alphaherpesvirus 1 strain HSV1-v60_d3_cu_gen_les, complete genome
    24. #>HSV1_S1-1
    25. #>HSV-Klinik_S2-1
    26. #If this results is similar to freebayes, means the results successfully include interhost-results.
    27. #TODO: In next step, we should feed another bam-files, e.g. the cleaned bam-file into the pipelines!
    28. #DOESN'T WORK: snakemake --cleanup-metadata data/03_multialign_to_ref/sampleNameList.txt data/03_multialign_to_ref/aligned_1.fasta --cores 1
    29. snakemake --printshellcmds --cores all
    30. mkdir data/04_intrahost
    31. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV1_S1.mapped.bam data/02_assembly/HSV1_S1.fasta data/04_intrahost/vphaser2.HSV1_S1.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    32. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    33. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession --loglevel=DEBUG
    34. bin/interhost.py snpEff data/04_intrahost/isnvs.vcf.gz OP297860.1 data/04_intrahost/isnvs.annot.vcf.gz j.huang@uke.de
    35. bin/intrahost.py iSNV_table data/04_intrahost/isnvs.annot.vcf.gz data/04_intrahost/isnvs.annot.txt.gz
    36. mv data/04_intrahost data/04_intrahost_including_interhost
    37. cd data/04_intrahost_including_interhost
    38. gunzip isnvs.annot.txt.gz
    39. ~/Scripts/filter_isnv.py isnvs.annot.txt 0.05
    40. cut -d$'\t' filtered_isnvs.annot.txt -f1-7
    41. bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV1_S1 HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV1_S1.txt.gz data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession
    42. awk '$7 >= 5' vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000.txt > vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000_0.05.txt
    43. awk '$7 >= 50' vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000.txt > vphaser2.HSV-Klinik_S2_removeDoubly_min5_max1000000_w25000_0.5.txt
    44. # How many SNPs?
    45. #bin/intrahost.py vphaser_one_sample data_v2/02_align_to_self/HSV-Klinik_S2.mapped.bam data_v2/02_assembly/HSV-Klinik_S2.fasta data_v2/04_intrahost/vphaser2.HSV-Klinik_S2_v2.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 1000000 --loglevel DEBUG
    46. #mv vphaser2.HSV-Klinik_S2.txt.gz
    47. # How many SNPs?
    48. awk '$7 >= 5' vphaser2.HSV-Klinik_S2_v2.txt > vphaser2.HSV-Klinik_S2_v2_.txt
    49. bin/intrahost.py vphaser_one_sample data_v2/02_align_to_self/HSV-Klinik_S2.mapped.bam data_v2/02_assembly/HSV-Klinik_S2.fasta data_v2/04_intrahost/vphaser2.HSV-Klinik_S2_v3.txt.gz --vphaserNumThreads 120 --minReadsEach 5 --maxBias 10 --loglevel DEBUG
    50. # How many SNPs?
    51. awk '$6 >= 0.05' isnvs.annot.txt > isnvs.annot_.txt
    52. -------
    53. #I used the viral-ngs get a table as follows:
    54. chr pos sample patient time alleles iSNV_freq Hw Hs eff_type eff_codon_dna eff_aa eff_aa_pos eff_prot_len eff_gene eff_protein
    55. OP297860 9012 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.0155954631379962 0.0307044893350152 1 intergenic_region n.9012C>A RL2-UL1 Gene_1996_5580-Gene_9025_10636
    56. OP297860 9017 HSV-Klinik_S2 HSV-Klinik_S2 C,A 0.0408905043162199 0.0784369419459701 1 intergenic_region n.9017C>A RL2-UL1 Gene_1996_5580-Gene_9025_10636
    57. In the process, the intrahost.py was used.
    58. intrahost.py - within-host genetic variation (iSNVs) The output has only contains
    59. I can so understand, the intrahost variants only reported. The chr OP297860 is only for the annotation. If a position in my clinical sample HSV-Klinik_S2 is different to OP297860, it will be not reported and not exists in the table.
    60. Column Descriptions in the Output Table
    61. The output table generated by this script will contain the following columns:
    62. chr: Chromosome or contig where the variant is located.
    63. pos: Position on the chromosome/contig of the variant.
    64. sample: The sample identifier for this variant.
    65. patient: Patient ID extracted from the sample name (assumes the format sample.patient).
    66. time: Time point of sample collection, extracted from the sample name (if present).
    67. alleles: The alleles involved in the variant. For example, C,A means Cytosine (C) and Adenine (A).
    68. iSNV_freq: Frequency of the variant in the sample. This is the sum of the frequencies of the variant alleles.
    69. Hw: Hardy-Weinberg equilibrium p-value for the variant. This is calculated from the genotype frequencies in the sample and indicates how well they conform to random mating expectations.
    70. Hs: Heterozygosity in the population based on consensus genotypes. It measures genetic diversity based on observed genotypes.
    71. eff_type: The type of effect the variant has on the gene, such as intergenic_region, start_lost, etc.
    72. eff_codon_dna: The effect of the variant at the DNA level (e.g., n.9012C>A).
    73. eff_aa: The amino acid effect of the variant (e.g., a change from one amino acid to another or a frameshift).
    74. eff_aa_pos: The position of the amino acid affected by the variant.
    75. eff_prot_len: The length of the protein after the variant is applied, which may be truncated if the variant causes a frameshift or a stop codon.
    76. eff_gene: The gene affected by the variant.
    77. eff_protein: The protein affected by the variant (e.g., a protein identifier like UXY89132.1).
    78. b'/home/jhuang/miniconda3/envs/viral-ngs4/bin/python\n'
    79. -------
    80. 2024-11-12 13:22:47,892 - cmd:193:main_argparse - INFO - software version: 1522433800, python version: 3.6.7 | packaged by conda-forge | (default, Feb 28 2019, 09:07:38)
    81. [GCC 7.3.0]
    82. 2024-11-12 13:22:47,893 - cmd:195:main_argparse - INFO - command: bin/intrahost.py merge_to_vcf refFasta=ref_genome/reference.fasta outVcf=data/04_intrahost/isnvs.vcf.gz samples=['HSV-Klinik_S2'] isnvs=['data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz'] alignments=['data/03_multialign_to_ref/aligned_1.fasta'] strip_chr_version=True naive_filter=False parse_accession=True loglevel=INFO
    83. 2024-11-12 13:22:47,893 - intrahost:476:merge_to_vcf - INFO - loaded CoordMapper for all genomes, starting VCF merge...
    84. Traceback (most recent call last):
    85. File "bin/intrahost.py", line 1152, in <module>
    86. util.cmd.main_argparse(__commands__, __doc__)
    87. File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 221, in main_argparse
    88. ret = args.func_main(args)
    89. File "/home/jhuang/Tools/viral-ngs/bin/util/cmd.py", line 102, in _main
    90. mainfunc(**args2)
    91. File "bin/intrahost.py", line 530, in merge_to_vcf
    92. raise LookupError("Not all reference sequences found in alignments.")
    93. LookupError: Not all reference sequences found in alignments.
    94. [Tue Nov 12 13:22:47 2024]
    95. Error in rule isnvs_vcf:
    96. jobid: 0
    97. output: data/04_intrahost/isnvs.vcf.gz, data/04_intrahost/isnvs.vcf.gz.tbi, data/04_intrahost/isnvs.annot.vcf.gz, data/04_intrahost/isnvs.annot.txt.gz, data/04_intrahost/isnvs.annot.vcf.gz.tbi
    98. RuleException:
    99. CalledProcessError in line 61 of /mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules:
    100. Command 'set -euo pipefail; bin/intrahost.py merge_to_vcf ref_genome/reference.fasta data/04_intrahost/isnvs.vcf.gz --samples HSV-Klinik_S2 --isnvs data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --alignments data/03_multialign_to_ref/aligned_1.fasta --strip_chr_version --parse_accession' returned non-zero exit status 1.
    101. File "/mnt/md1/DATA_md1/Data_Nicole_CaptureProbeSequencing/bin/pipes/rules/intrahost.rules", line 61, in __rule_isnvs_vcf
    102. File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    103. Exiting because a job execution failed. Look above for error message
    104. Columns Breakdown:
    105. Ref_Pos (e.g., 55, 104, 210):
    106. This refers to the position in the genome where the variant occurs. In this example, the variants occur at positions 55, 104, and 210.
    107. Var (e.g., T C, G A, T C):
    108. This is the variant observed at that position. It shows the reference base (before the variant) and the alternate base (after the variant). For example:
    109. At position 55, the reference base is T and the alternate base is C.
    110. At position 104, the reference base is G and the alternate base is A.
    111. At position 210, the reference base is T and the alternate base is C.
    112. Cons (e.g., 0.8156, 0.1674, 0.1065):
    113. This represents the variant frequency (or proportion) in the sample, expressed as a decimal. It shows the fraction of reads supporting the alternate base (C, A, etc.). For example:
    114. At position 55, 81.56% of the reads support the alternate base C.
    115. At position 104, 16.74% of the reads support the alternate base A.
    116. At position 210, 10.65% of the reads support the alternate base C.
    117. Strd_bias_pval (e.g., 0.8156, 0.1674, 0.1065):
    118. This represents the strand bias p-value for the variant. It tests if there is an uneven distribution of reads between the forward and reverse strands for the variant. A higher p-value suggests no significant strand bias. A lower p-value suggests a possible strand bias, meaning the variant might be incorrectly called due to a bias in sequencing reads from one strand.
    119. Type (e.g., snp):
    120. This indicates the type of variant. In this case, it's a SNP (single nucleotide polymorphism). It means that a single nucleotide in the genome has been altered.
    121. Var_perc (e.g., 16.1, 14.07, 10.58):
    122. This represents the percentage of variants for each alternate base, which is very similar to the Cons column but expressed as a percentage. For example:
    123. At position 55, the alternate base C is observed in 16.1% of the reads.
    124. At position 104, the alternate base A is observed in 14.07% of the reads.
    125. At position 210, the alternate base C is observed in 10.58% of the reads.
    126. SNP_or_LP_Profile (e.g., C:65:34 T:13:6):
    127. This contains information on the read counts for the reference base (T, G, etc.) and the alternate base (C, A, etc.). The format is:
    128. Reference base count (forward strand : reverse strand)
    129. Alternate base count (forward strand : reverse strand)
    130. For example, at position 55:
    131. C (alternate base) has 65 reads on the forward strand and 34 on the reverse strand.
    132. T (reference base) has 13 reads on the forward strand and 6 on the reverse strand.
    133. Summary: SNPV and LPV
    134. The last line of the output gives a summary of the total number of SNPs and LPs (likely Low-Quality Polymorphisms or Low Probability Variants):
    135. # Summary: SNPV: 132; LPV: 0
    136. SNPV: 132:
    137. This indicates the total number of SNP variants detected in the data. In this case, there are 132 SNPs identified.
    138. LPV: 0:
    139. This indicates the number of Low Probability Variants (LPVs). A value of 0 means no low-quality variant calls were detected, indicating that the analysis did not identify any variants with low confidence.
    140. # Minimum number of reads on each strand
    141. vphaser_min_reads_each: 5
    142. # Maximum allowable ratio of number of reads on the two
    143. # strands. Ignored if vphaser_max_bins=0.
    144. vphaser_max_bins: 10
    145. # A simple filter for the VCF merge step.
    146. # If set to true, keep only the alleles that have at least two
    147. # independent libraries of support and
    148. # allele freq > 0.005. If false, no filtering is performed.
    149. vcf_merge_naive_filter: false
  22. (Optional)

    1. 152526
    2. GapFiller.pl -l libraries_p2564.txt -s data/02_assembly/p2564.fasta
    3. #parainfluenza bwa /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R1.fastq.gz /home/jhuang/DATA/Data_parainfluenza/trimmed/p2564_R2.fastq.gz 300 1.0 FR
    4. #since HSV1 and HSV-Klinik_S2 has different regions covered --> multialign_to_ref is none!
    5. bin/intrahost.py vphaser_one_sample data/02_align_to_self/HSV-Klinik_S2.mapped.bam data/02_assembly/HSV-Klinik_S2.fasta data/04_intrahost/vphaser2.HSV-Klinik_S2.txt.gz --vphaserNumThreads 120 --removeDoublyMappedReads --minReadsEach 5 --maxBias 10
    6. (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.mapped.bam
    7. 162156 + 0 in total (QC-passed reads + QC-failed reads)
    8. 0 + 0 secondary
    9. 0 + 0 supplementary
    10. 0 + 0 duplicates
    11. 162156 + 0 mapped (100.00% : N/A)
    12. 162156 + 0 paired in sequencing
    13. 81048 + 0 read1
    14. 81108 + 0 read2
    15. 161068 + 0 properly paired (99.33% : N/A)
    16. 161630 + 0 with itself and mate mapped
    17. 526 + 0 singletons (0.32% : N/A)
    18. 0 + 0 with mate mapped to a different chr
    19. 0 + 0 with mate mapped to a different chr (mapQ>=5)
    20. (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/01_per_sample$ samtools flagstat HSV-Klinik_S2.taxfilt.bam
    21. 800454 + 0 in total (QC-passed reads + QC-failed reads)
    22. 0 + 0 secondary
    23. 0 + 0 supplementary
    24. 0 + 0 duplicates
    25. 0 + 0 mapped (0.00% : N/A)
    26. 800454 + 0 paired in sequencing
    27. 400227 + 0 read1
    28. 400227 + 0 read2
    29. 0 + 0 properly paired (0.00% : N/A)
    30. 0 + 0 with itself and mate mapped
    31. 0 + 0 singletons (0.00% : N/A)
    32. 0 + 0 with mate mapped to a different chr
    33. 0 + 0 with mate mapped to a different chr (mapQ>=5)
    34. (viral-ngs4) jhuang@WS-2290C:~/DATA/Data_Nicole_CaptureProbeSequencing/data/02_align_to_self$ samtools flagstat HSV-Klinik_S2.bam
    35. 885528 + 0 in total (QC-passed reads + QC-failed reads)
    36. 0 + 0 secondary
    37. 0 + 0 supplementary
    38. 191932 + 0 duplicates
    39. 354088 + 0 mapped (39.99% : N/A)
    40. 885528 + 0 paired in sequencing
    41. 442764 + 0 read1
    42. 442764 + 0 read2
    43. 323502 + 0 properly paired (36.53% : N/A)
    44. 324284 + 0 with itself and mate mapped
    45. 29804 + 0 singletons (3.37% : N/A)
    46. 0 + 0 with mate mapped to a different chr
    47. 0 + 0 with mate mapped to a different chr (mapQ>=5)
  23. Summarize statistics from snakemake-output

    1. samples-runs.txt
    2. samtools flagstat data/02_align_to_self/838_S1.mapped.bam
    3. samtools flagstat data/02_align_to_self/840_S2.mapped.bam
    4. samtools flagstat data/02_align_to_self/820_S3.mapped.bam
    5. samtools flagstat data/02_align_to_self/828_S4.mapped.bam
    6. samtools flagstat data/02_align_to_self/815_S5.mapped.bam
    7. samtools flagstat data/02_align_to_self/834_S6.mapped.bam
    8. samtools flagstat data/02_align_to_self/808_S7.mapped.bam
    9. samtools flagstat data/02_align_to_self/811_S8.mapped.bam
    10. samtools flagstat data/02_align_to_self/837_S9.mapped.bam
    11. samtools flagstat data/02_align_to_self/768_S10.mapped.bam
    12. samtools flagstat data/02_align_to_self/773_S11.mapped.bam
    13. samtools flagstat data/02_align_to_self/767_S12.mapped.bam
    14. samtools flagstat data/02_align_to_self/810_S13.mapped.bam
    15. samtools flagstat data/02_align_to_self/814_S14.mapped.bam
    16. samtools flagstat data/02_align_to_self/10121-16_S15.mapped.bam --> 3c
    17. Origin of hepatitis C virus genotype 3 in Africa as estimated
    18. through an evolutionary analysis of the full-length genomes of nine
    19. subtypes, including the newly sequenced 3d and 3e
    20. samtools flagstat data/02_align_to_self/7510-15_S16.mapped.bam -->
    21. samtools flagstat data/02_align_to_self/828-17_S17.mapped.bam
    22. samtools flagstat data/02_align_to_self/8806-15_S18.mapped.bam
    23. samtools flagstat data/02_align_to_self/9881-16_S19.mapped.bam
    24. samtools flagstat data/02_align_to_self/8981-14_S20.mapped.bam
  24. Consensus sequences of each and of all isolates

    1. cp data/02_assembly/*.fasta ./
    2. for sample in 838_S1 840_S2 820_S3 828_S4 815_S5 834_S6 808_S7 811_S8 837_S9 768_S10 773_S11 767_S12 810_S13 814_S14 10121-16_S15 7510-15_S16 828-17_S17 8806-15_S18 9881-16_S19 8981-14_S20; do
    3. for sample in p953-84660-tsek p938-16972-nra p942-88507-nra p943-98523-nra p944-103323-nra p947-105565-nra p948-112830-nra; do \
    4. mv ${sample}.fasta ${sample}.fa
    5. cat all.fa ${sample}.fa >> all.fa
    6. done
    7. cat RSV_dedup.fa all.fa > RSV_all.fa
    8. mafft --adjustdirection RSV_all.fa > RSV_all.aln
    9. snp-sites RSV_all.aln -o RSV_all_.aln
  25. Finding the next strain with Phylogenetics: send both HCV231_all.png and HCV231_all.pdf to the Nicole

    1. #1, generate tree
    2. cat SARS-CoV-2_len25000_w60_newheader.fa ~/rtpd_files/2029-AW_S5/idba_ud_assembly/gapped_contig.fa > CoV2_all.fa
    3. mafft --adjustdirection CoV2_all.fa > CoV2_all.aln
    4. snp-sites CoV2_all.aln -o CoV2_all_.aln
    5. fasttree -gtr -nt RSV_all_.aln > RSV_all.tree
    6. fasttree -gtr -nt Ortho_SNP_matrix_RAxML.fa > Ortho_SNP_matrix_RAxML.tree
    7. raxml-ng --all --model GTR+G+ASC_LEWIS --prefix CoV2_all_raxml.aln --threads 1 --msa CoV2_all_.aln --bs-trees 1000 --redo
    8. #raxml-ng --all --model GTR+G+ASC_LEWIS --prefix raxml-ng/snippy.core.aln --threads 1 --msa variants/snippy.core.aln --bs-trees 1000 --redo
    9. #2, open tree on Dendroscope, from phylogenetic tree, get genotype-refs as follows,
    10. 1a: S10, S11, 814_S14(3-->1a?), S18 --> 1a_EF407457
    11. 1b: S12 --> 1b_M58335
    12. 2a: 815_S5(3-->2a?) --> 2a_D00944
    13. 2c: S20 --> 2c_D50409
    14. 3a: S3, S7, S8, S13, S15, S16, S19 --> 3c_KY620605
    15. 4d: S1, S2, S9 --> 4d_EU392172
    16. 4k: S4, S6 --> 4k_EU392173
    17. --> KX249682.1
    18. --> KX765935.1
    19. --> KM517573.1
    20. cd data/02_assembly/
    21. cat p2.fasta p3e.fasta p4e.fasta p5e.fasta > all.fasta
    22. sed -i -e 's/-1//g' all.fasta
    23. #sed -i -e 's/e-1//g' all.fasta
    24. mafft --adjustdirection --clustalout all.fasta > all.aln
    25. # MANUALLY CORRECTION!
    26. ##POLISH the assembled contigs
    27. #for sample in p953 p938 p942 p943 p944 p947 p948 p955 p954 p952 p951 p946 p945 p940; do
    28. # rm ${sample}_polished.fa
    29. # #seqtk sample ../../trimmed/${sample}_R1.fastq.gz 0.1 > ${sample}_0.1_R1.fastq
    30. # #seqtk sample ../../trimmed/${sample}_R2.fastq.gz 0.1 > ${sample}_0.1_R2.fastq
    31. # polish_viral_ref.sh -1 ../../trimmed/${sample}_R1.fastq.gz -2 ../../trimmed/${sample}_R2.fastq.gz -r ${sample}.fasta -o ${sample}_polished.fa -t 6
    32. #done
    33. for sample in p946 p954 p952 p948 p945 p947 p955 p943 p951 p942; do #all.aln
    34. for sample in p944 p938 p953 p940; do #all2.aln
    35. for sample in p2 p3 p4 p5; do
    36. grep "${sample}" all.aln > REF${sample}.fasta
    37. #cut -f2-2 -d$'\t' REF${sample}.fasta > REF${sample}.fast
    38. sed -i -e "s/${sample} //g" REF${sample}.fasta
    39. sed -i -e "s/${sample}-1 //g" REF${sample}.fasta
    40. sed -i -e 's/-//g' REF${sample}.fasta
    41. echo ">REF${sample}" > REF${sample}.header
    42. cat REF${sample}.header REF${sample}.fasta > REF${sample}.fas
    43. seqkit seq -u REF${sample}.fas -o REF${sample}.fa
    44. cp REF${sample}.fa ${sample}.fa
    45. mv REF${sample}.fa ../..
    46. sed -i -e "s/REF//g" ${sample}.fa #still under data/02_assembly/
    47. done
    48. #ReferenceSeeker determines closely related reference genomes
    49. #https://github.com/oschwengers/referenceseeker
    50. (referenceseeker) jhuang@hamburg:~/DATA/Data_Holger_Efaecium$ ~/Tools/referenceseeker/bin/referenceseeker -v ~/REFs/bacteria-refseq/ shovill/noAB_wildtype/contigs.fasta
    51. # Annotating the fasta using VAPiD
    52. makeblastdb -in *.fasta -dbtype nucl
    53. python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta p946R.fa ~/REFs/template_Less.sbt
    54. python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta REFp944.fa ~/REFs/template_Less.sbt # KT581445.1 selected!
    55. python ~/Tools/VAPiD/vapid3.py --db ~/REFs/all_virus/all_virus.fasta contigs_final.fasta ~/REFs/template_Amir.sbt
    56. python ~/Tools/VAPiD/vapid3.py --online contigs_final.fasta ~/REFs/template_Amir.sbt
  26. All packages under the viral-ngs4 env, note that novoalign is not installed. The used Novoalign path: /home/jhuang/Tools/novocraft_v3/novoalign; the used gatk: /usr/local/bin/gatk using /home/jhuang/Tools/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar (see the point 9).

    1. mamba remove viral-ngs --all
    2. mamba remove viral-ngs-env --all
    3. conda remove viral-ngs-java7 --all
    4. conda remove viral-ngs-java8 --all
    5. conda remove viral-ngs-py36 --all
    6. conda remove viral-ngs2 --all
    7. conda remove viral-ngs3 --all
    8. jhuang@WS-2290C:~$ conda activate viral-ngs4
    9. (viral-ngs4) jhuang@WS-2290C:~$ conda list
    10. # packages in environment at /home/jhuang/miniconda3/envs/viral-ngs4:
    11. #
    12. # Name Version Build Channel
    13. _libgcc_mutex 0.1 conda_forge conda-forge
    14. _openmp_mutex 4.5 2_gnu conda-forge
    15. _r-mutex 1.0.1 anacondar_1 conda-forge
    16. alsa-lib 1.2.3.2 h166bdaf_0 conda-forge
    17. bamtools 2.5.2 hdcf5f25_5 bioconda
    18. bedtools 2.31.1 hf5e1c6e_2 bioconda
    19. binutils_impl_linux-64 2.43 h4bf12b8_2 conda-forge
    20. binutils_linux-64 2.43 h4852527_2 conda-forge
    21. biopython 1.79 py36h8f6f2f9_0 conda-forge
    22. blast 2.6.0 boost1.64_2 bioconda
    23. bmfilter 3.101 h4ac6f70_5 bioconda
    24. bmtagger 3.101 h470a237_4 bioconda
    25. bmtool 3.101 hdbdd923_5 bioconda
    26. boost 1.64.0 py36_4 conda-forge
    27. boost-cpp 1.64.0 1 conda-forge
    28. bowtie 1.3.1 py36h769816f_3 bioconda
    29. bowtie2 2.5.4 h7071971_4 bioconda
    30. bwa 0.7.18 he4a0461_1 bioconda
    31. bwidget 1.9.14 ha770c72_1 conda-forge
    32. bzip2 1.0.8 h4bc722e_7 conda-forge
    33. c-ares 1.34.2 heb4867d_0 conda-forge
    34. ca-certificates 2024.9.24 h06a4308_0
    35. cairo 1.16.0 h18b612c_1001 conda-forge
    36. cd-hit 4.8.1 h43eeafb_10 bioconda
    37. cd-hit-auxtools 4.8.1 h4ac6f70_3 bioconda
    38. certifi 2021.5.30 py36h5fab9bb_0 conda-forge
    39. curl 7.68.0 hf8cf82a_0 conda-forge
    40. cycler 0.11.0 pyhd8ed1ab_0 conda-forge
    41. dbus 1.13.6 hfdff14a_1 conda-forge
    42. diamond 2.1.10 h43eeafb_2 bioconda
    43. expat 2.6.4 h5888daf_0 conda-forge
    44. extract_fullseq 3.101 h4ac6f70_5 bioconda
    45. fastqc 0.12.1 hdfd78af_0 bioconda
    46. font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge
    47. fontconfig 2.14.1 hef1e5e3_0
    48. freetype 2.12.1 h267a509_2 conda-forge
    49. fribidi 1.0.10 h36c2ea0_0 conda-forge
    50. future 0.18.2 py36h5fab9bb_3 conda-forge
    51. gap2seq 2.1 boost1.64_1 bioconda
    52. gatk 3.6 hdfd78af_11 bioconda
    53. gcc_impl_linux-64 14.2.0 h6b349bd_1 conda-forge
    54. gcc_linux-64 14.2.0 h5910c8f_5 conda-forge
    55. gettext 0.22.5 he02047a_3 conda-forge
    56. gettext-tools 0.22.5 he02047a_3 conda-forge
    57. gfortran_impl_linux-64 14.2.0 hc73f493_1 conda-forge
    58. gfortran_linux-64 14.2.0 hda50785_5 conda-forge
    59. giflib 5.2.2 hd590300_0 conda-forge
    60. glib 2.66.3 h58526e2_0 conda-forge
    61. graphite2 1.3.13 h59595ed_1003 conda-forge
    62. gsl 2.4 h294904e_1006 conda-forge
    63. gst-plugins-base 1.14.5 h0935bb2_2 conda-forge
    64. gstreamer 1.14.5 h36ae1b5_2 conda-forge
    65. gxx_impl_linux-64 14.2.0 h2c03514_1 conda-forge
    66. gxx_linux-64 14.2.0 h9423afd_5 conda-forge
    67. harfbuzz 2.4.0 h37c48d4_1 conda-forge
    68. icu 58.2 hf484d3e_1000 conda-forge
    69. jpeg 9e h0b41bf4_3 conda-forge
    70. kernel-headers_linux-64 3.10.0 he073ed8_18 conda-forge
    71. keyutils 1.6.1 h166bdaf_0 conda-forge
    72. kiwisolver 1.3.1 py36h605e78d_1 conda-forge
    73. kmer-jellyfish 2.3.1 h4ac6f70_2 bioconda
    74. krb5 1.16.4 h2fd8d38_0 conda-forge
    75. last 876 py36_0 bioconda
    76. lcms2 2.12 hddcbb42_0 conda-forge
    77. ld_impl_linux-64 2.43 h712a8e2_2 conda-forge
    78. libasprintf 0.22.5 he8f35ee_3 conda-forge
    79. libasprintf-devel 0.22.5 he8f35ee_3 conda-forge
    80. libblas 3.9.0 25_linux64_openblas conda-forge
    81. libcblas 3.9.0 25_linux64_openblas conda-forge
    82. libcurl 7.68.0 hda55be3_0 conda-forge
    83. libdeflate 1.21 h4bc722e_0 conda-forge
    84. libedit 3.1.20191231 he28a2e2_2 conda-forge
    85. libev 4.33 hd590300_2 conda-forge
    86. libexpat 2.6.4 h5888daf_0 conda-forge
    87. libffi 3.2.1 he1b5a44_1007 conda-forge
    88. libgcc 14.2.0 h77fa898_1 conda-forge
    89. libgcc-devel_linux-64 14.2.0 h41c2201_101 conda-forge
    90. libgcc-ng 14.2.0 h69a702a_1 conda-forge
    91. libgettextpo 0.22.5 he02047a_3 conda-forge
    92. libgettextpo-devel 0.22.5 he02047a_3 conda-forge
    93. libgfortran-ng 7.5.0 h14aa051_20 conda-forge
    94. libgfortran4 7.5.0 h14aa051_20 conda-forge
    95. libgfortran5 14.2.0 hd5240d6_1 conda-forge
    96. libglib 2.66.3 hbe7bbb4_0 conda-forge
    97. libgomp 14.2.0 h77fa898_1 conda-forge
    98. libiconv 1.17 hd590300_2 conda-forge
    99. libidn11 1.33 h7b6447c_0
    100. liblapack 3.9.0 25_linux64_openblas conda-forge
    101. libnghttp2 1.51.0 hdcd2b5c_0 conda-forge
    102. libnsl 2.0.1 hd590300_0 conda-forge
    103. libopenblas 0.3.28 pthreads_h94d23a6_0 conda-forge
    104. libpng 1.6.43 h2797004_0 conda-forge
    105. libsanitizer 14.2.0 h2a3dede_1 conda-forge
    106. libsqlite 3.46.0 hde9e2c9_0 conda-forge
    107. libssh2 1.10.0 haa6b8db_3 conda-forge
    108. libstdcxx 14.2.0 hc0a3c3a_1 conda-forge
    109. libstdcxx-devel_linux-64 14.2.0 h41c2201_101 conda-forge
    110. libstdcxx-ng 14.2.0 h4852527_1 conda-forge
    111. libtiff 4.2.0 hf544144_3 conda-forge
    112. libuuid 1.0.3 h7f8727e_2
    113. libwebp-base 1.4.0 hd590300_0 conda-forge
    114. libxcb 1.17.0 h8a09558_0 conda-forge
    115. libxcrypt 4.4.36 hd590300_1 conda-forge
    116. libxml2 2.9.14 h74e7548_0
    117. libzlib 1.2.13 h4ab18f5_6 conda-forge
    118. llvm-openmp 8.0.1 hc9558a2_0 conda-forge
    119. mafft 7.221 0 bioconda
    120. make 4.4.1 hb9d3cd8_2 conda-forge
    121. matplotlib 3.3.4 py36h5fab9bb_0 conda-forge
    122. matplotlib-base 3.3.4 py36hd391965_0 conda-forge
    123. mummer4 4.0.0rc1 pl5321hdbdd923_7 bioconda
    124. muscle 3.8.1551 h7d875b9_6 bioconda
    125. mvicuna 1.0 h4ac6f70_10 bioconda
    126. ncurses 6.5 he02047a_1 conda-forge
    127. numpy 1.19.5 py36hfc0c790_2 conda-forge
    128. olefile 0.46 pyh9f0ad1d_1 conda-forge
    129. openjdk 8.0.412 hd590300_1 conda-forge
    130. openjpeg 2.4.0 hb52868f_1 conda-forge
    131. openmp 8.0.1 0 conda-forge
    132. openssl 1.1.1w hd590300_0 conda-forge
    133. pandas 1.1.5 py36h284efc9_0 conda-forge
    134. pango 1.42.4 h7062337_4 conda-forge
    135. parallel 20240922 ha770c72_0 conda-forge
    136. pcre 8.45 h9c3ff4c_0 conda-forge
    137. perl 5.32.1 7_hd590300_perl5 conda-forge
    138. picard 3.0.0 hdfd78af_0 bioconda
    139. pigz 2.6 h27cfd23_0
    140. pillow 8.2.0 py36ha6010c0_1 conda-forge
    141. pip 21.3.1 pyhd8ed1ab_0 conda-forge
    142. pixman 0.38.0 h516909a_1003 conda-forge
    143. prinseq 0.20.4 hdfd78af_5 bioconda
    144. pthread-stubs 0.4 hb9d3cd8_1002 conda-forge
    145. pybedtools 0.9.0 py36h7281c5b_1 bioconda
    146. pyparsing 3.1.4 pyhd8ed1ab_0 conda-forge
    147. pyqt 5.9.2 py36hcca6a23_4 conda-forge
    148. pysam 0.16.0 py36h873a209_0 bioconda
    149. python 3.6.7 h381d211_1004 conda-forge
    150. python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
    151. python_abi 3.6 2_cp36m conda-forge
    152. pytz 2023.3.post1 pyhd8ed1ab_0 conda-forge
    153. pyyaml 5.4.1 py36h8f6f2f9_1 conda-forge
    154. qt 5.9.7 h52cfd70_2 conda-forge
    155. r-assertthat 0.2.1 r36h6115d3f_2 conda-forge
    156. r-backports 1.2.1 r36hcfec24a_0 conda-forge
    157. r-base 3.6.1 h9bb98a2_1
    158. r-bitops 1.0_7 r36hcfec24a_0 conda-forge
    159. r-brio 1.1.2 r36hcfec24a_0 conda-forge
    160. r-callr 3.7.0 r36hc72bb7e_0 conda-forge
    161. r-catools 1.18.2 r36h03ef668_0 conda-forge
    162. r-cli 2.5.0 r36hc72bb7e_0 conda-forge
    163. r-colorspace 2.0_1 r36hcfec24a_0 conda-forge
    164. r-crayon 1.4.1 r36hc72bb7e_0 conda-forge
    165. r-desc 1.3.0 r36hc72bb7e_0 conda-forge
    166. r-diffobj 0.3.4 r36hcfec24a_0 conda-forge
    167. r-digest 0.6.27 r36h03ef668_0 conda-forge
    168. r-ellipsis 0.3.2 r36hcfec24a_0 conda-forge
    169. r-evaluate 0.14 r36h6115d3f_2 conda-forge
    170. r-fansi 0.4.2 r36hcfec24a_0 conda-forge
    171. r-farver 2.1.0 r36h03ef668_0 conda-forge
    172. r-ggplot2 3.3.3 r36hc72bb7e_0 conda-forge
    173. r-glue 1.4.2 r36hcfec24a_0 conda-forge
    174. r-gplots 3.1.1 r36hc72bb7e_0 conda-forge
    175. r-gsalib 2.1 r36_1002 conda-forge
    176. r-gtable 0.3.0 r36h6115d3f_3 conda-forge
    177. r-gtools 3.8.2 r36hcdcec82_1 conda-forge
    178. r-isoband 0.2.4 r36h03ef668_0 conda-forge
    179. r-jsonlite 1.7.2 r36hcfec24a_0 conda-forge
    180. r-kernsmooth 2.23_20 r36h742201e_0 conda-forge
    181. r-labeling 0.4.2 r36h142f84f_0 conda-forge
    182. r-lattice 0.20_44 r36hcfec24a_0 conda-forge
    183. r-lifecycle 1.0.0 r36hc72bb7e_0 conda-forge
    184. r-magrittr 2.0.1 r36hcfec24a_1 conda-forge
    185. r-mass 7.3_54 r36hcfec24a_0 conda-forge
    186. r-matrix 1.3_3 r36he454529_0 conda-forge
    187. r-mgcv 1.8_35 r36he454529_0 conda-forge
    188. r-munsell 0.5.0 r36h6115d3f_1003 conda-forge
    189. r-nlme 3.1_152 r36h859d828_0 conda-forge
    190. r-pillar 1.6.1 r36hc72bb7e_0 conda-forge
    191. r-pkgconfig 2.0.3 r36h6115d3f_1 conda-forge
    192. r-pkgload 1.2.1 r36h03ef668_0 conda-forge
    193. r-plyr 1.8.6 r36h0357c0b_1 conda-forge
    194. r-praise 1.0.0 r36h6115d3f_1004 conda-forge
    195. r-processx 3.5.2 r36hcfec24a_0 conda-forge
    196. r-ps 1.6.0 r36hcfec24a_0 conda-forge
    197. r-r6 2.5.0 r36hc72bb7e_0 conda-forge
    198. r-rcolorbrewer 1.1_2 r36h6115d3f_1003 conda-forge
    199. r-rcpp 1.0.6 r36h03ef668_0 conda-forge
    200. r-rematch2 2.1.2 r36h6115d3f_1 conda-forge
    201. r-reshape 0.8.8 r36hcdcec82_2 conda-forge
    202. r-rlang 0.4.11 r36hcfec24a_0 conda-forge
    203. r-rprojroot 2.0.2 r36hc72bb7e_0 conda-forge
    204. r-rstudioapi 0.13 r36hc72bb7e_0 conda-forge
    205. r-scales 1.1.1 r36h6115d3f_0 conda-forge
    206. r-testthat 3.0.2 r36h03ef668_0 conda-forge
    207. r-tibble 3.1.2 r36hcfec24a_0 conda-forge
    208. r-utf8 1.2.1 r36hcfec24a_0 conda-forge
    209. r-vctrs 0.3.8 r36hcfec24a_1 conda-forge
    210. r-viridislite 0.4.0 r36hc72bb7e_0 conda-forge
    211. r-waldo 0.2.5 r36hc72bb7e_0 conda-forge
    212. r-withr 2.4.2 r36hc72bb7e_0 conda-forge
    213. readline 7.0 hf8c457e_1001 conda-forge
    214. salmon 0.14.2 ha0cc327_0 bioconda
    215. samtools 1.6 h244ad75_5 bioconda
    216. setuptools 58.0.4 py36h5fab9bb_2 conda-forge
    217. sip 4.19.8 py36hf484d3e_1000 conda-forge
    218. six 1.16.0 pyh6c4a22f_0 conda-forge
    219. snpeff 4.1l hdfd78af_8 bioconda
    220. spades 3.15.5 h95f258a_1 bioconda
    221. sqlite 3.28.0 h8b20d00_0 conda-forge
    222. srprism 2.4.24 h6a68c12_5 bioconda
    223. sysroot_linux-64 2.17 h4a8ded7_18 conda-forge
    224. tbb 2020.3 hfd86e86_0
    225. tbl2asn 25.7 h9ee0642_1 bioconda
    226. tk 8.6.13 noxft_h4845f30_101 conda-forge
    227. tktable 2.10 h8bc8fbc_6 conda-forge
    228. tornado 6.1 py36h8f6f2f9_1 conda-forge
    229. trimmomatic 0.39 hdfd78af_2 bioconda
    230. trinity 2.8.5 h8b12597_5 bioconda
    231. tzdata 2024b hc8b5060_0 conda-forge
    232. unzip 6.0 h611a1e1_0
    233. vphaser2 2.0 h7a259b3_14 bioconda
    234. wheel 0.37.1 pyhd8ed1ab_0 conda-forge
    235. xorg-libice 1.0.10 h7f98852_0 conda-forge
    236. xorg-libsm 1.2.2 h470a237_5 conda-forge
    237. xorg-libx11 1.8.10 h4f16b4b_0 conda-forge
    238. xorg-libxau 1.0.11 hb9d3cd8_1 conda-forge
    239. xorg-libxdmcp 1.1.5 hb9d3cd8_0 conda-forge
    240. xorg-libxext 1.3.6 hb9d3cd8_0 conda-forge
    241. xorg-libxfixes 6.0.1 hb9d3cd8_0 conda-forge
    242. xorg-libxi 1.8.2 hb9d3cd8_0 conda-forge
    243. xorg-libxrender 0.9.11 hb9d3cd8_1 conda-forge
    244. xorg-libxtst 1.2.5 hb9d3cd8_3 conda-forge
    245. xorg-xorgproto 2024.1 hb9d3cd8_1 conda-forge
    246. xz 5.2.6 h166bdaf_0 conda-forge
    247. yaml 0.2.5 h7f98852_2 conda-forge
    248. zlib 1.2.13 h4ab18f5_6 conda-forge
    249. zstd 1.5.6 ha6fb4c9_0 conda-forge
  27. commands of viral-ngs

    1. bin/interhost.py
    2. Enter a subcommand to view additional information:
    3. interhost.py snpEff [...]
    4. Annotate variants in VCF file with translation consequences
    5. using snpEff.
    6. interhost.py align_mafft [...]
    7. Run the mafft alignment on the input FASTA file.
    8. interhost.py multichr_mafft [...]
    9. Run the mafft alignment on a series of chromosomes provided
    10. in sample-partitioned FASTA files. Output as FASTA. (i.e.
    11. file1.fasta would contain chr1, chr2, chr3; file2.fasta
    12. would also contain chr1, chr2, chr3)
    13. bin/ncbi.py
    14. Enter a subcommand to view additional information:
    15. ncbi.py tbl_transfer [...]
    16. This function takes an NCBI TBL file describing features on
    17. a genome(genes, etc) and transfers them to a new genome.
    18. ncbi.py tbl_transfer_prealigned [...]
    19. This breaks out the ref and alt sequences into separate
    20. fasta files, and thencreates unified files containing the
    21. reference sequence first and the alt second. Each of these
    22. unified filesis then passed as a cmap to
    23. tbl_transfer_common. This function expects to receive one
    24. fasta file containing a multialignment of a single
    25. segment/chromosome alongwith the respective reference
    26. sequence for that segment/chromosome. It also expects a
    27. reference containing allreference segments/chromosomes, so
    28. that the reference sequence can be identified in the input
    29. file by name. Italso expects a list of reference tbl files,
    30. where each file is named according to the ID present for
    31. itscorresponding sequence in the refFasta. For each non-
    32. reference sequence present in the inputFasta, two files
    33. arewritten: a fasta containing the segment/chromosome for
    34. the same, along with its corresponding feature table
    35. ascreated by tbl_transfer_common.
    36. ncbi.py fetch_fastas [...]
    37. This function downloads and saves the FASTA filesfrom the
    38. Genbank CoreNucleotide database given a given list of
    39. accession IDs.
    40. ncbi.py fetch_feature_tables [...]
    41. This function downloads and savesfeature tables from the
    42. Genbank CoreNucleotide database given a given list of
    43. accession IDs.
    44. ncbi.py fetch_genbank_records [...]
    45. This function downloads and savesfull flat text records from
    46. Genbank CoreNucleotide database given a given list of
    47. accession IDs.
    48. ncbi.py prep_genbank_files [...]
    49. Prepare genbank submission files. Requires .fasta and .tbl
    50. files as input,as well as numerous other metadata files for
    51. the submission. Creates adirectory full of files (.sqn in
    52. particular) that can be sent to GenBank.
    53. ncbi.py prep_sra_table [...]
    54. This is a very lazy hack that creates a basic table that can
    55. bepasted into various columns of an SRA submission
    56. spreadsheet. It probablydoesn't work in all cases.
  28. ~/Scripts/check_sequence_differences.py

    1. #!/usr/bin/env python3
    2. from Bio import AlignIO
    3. import sys
    4. # Check if correct arguments are provided
    5. if len(sys.argv) != 2:
    6. print("Usage: python check_sequence_differences.py <input_clustal>")
    7. sys.exit(1)
    8. # Get the input file name from the command-line arguments
    9. input_file = sys.argv[1]
    10. # Read the alignment from the input CLUSTAL file
    11. alignment = AlignIO.read(input_file, "clustal")
    12. # Extract the sequences for easy comparison
    13. seq_op = alignment[0].seq
    14. seq_hsv1 = alignment[1].seq
    15. seq_hsv_klinik = alignment[2].seq
    16. # Initialize a list to store positions with differences
    17. differences = []
    18. # Iterate over each position in the alignment
    19. for i in range(len(seq_op)):
    20. op_base = seq_op[i]
    21. hsv1_base = seq_hsv1[i]
    22. hsv_klinik_base = seq_hsv_klinik[i]
    23. # Compare the sequences at the current position
    24. if op_base != hsv1_base or op_base != hsv_klinik_base or hsv1_base != hsv_klinik_base:
    25. differences.append((i + 1, op_base, hsv1_base, hsv_klinik_base))
    26. # Print the differences
    27. if differences:
    28. print("Differences found at the following positions:")
    29. for diff in differences:
    30. pos, op_base, hsv1_base, hsv_klinik_base = diff
    31. print(f"Position {pos}: OP297860.1 = {op_base}, HSV1_S1-1 = {hsv1_base}, HSV-Klinik_S2-1 = {hsv_klinik_base}")
    32. else:
    33. print("No differences found between the sequences.")
  29. ~/Scripts/summarize_snippy_res.py

    1. import pandas as pd
    2. import glob
    3. import argparse
    4. import os
    5. #python3 summarize_snps_indels.py snippy_HDRNA_01/snippy
    6. #The following record for 2365295 is wrong, since I am sure in the HDRNA_01_K010, it should be a 'G', since in HDRNA_01_K010.csv
    7. #CP133676,2365295,snp,A,G,G:178 A:0
    8. #
    9. #The current output is as follows:
    10. #CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,None,,,,,,None,None
    11. #CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
    12. #grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
    13. #BUG: CP133676,2365295,A,snp,A,A,A,A,A,A,A,A,A,A,nan,,,,,,nan,nan
    14. import pandas as pd
    15. import glob
    16. import argparse
    17. import os
    18. def main(base_directory):
    19. # List of isolate identifiers
    20. isolates = ["HSV1_S1", "HSV-Klinik_S2"]
    21. expected_columns = ["CHROM", "POS", "REF", "ALT", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"]
    22. # Find all CSV files in the directory and its subdirectories
    23. csv_files = glob.glob(os.path.join(base_directory, '**', '*.csv'), recursive=True)
    24. # Create an empty DataFrame to store the summary
    25. summary_df = pd.DataFrame()
    26. # Iterate over each CSV file
    27. for file_path in csv_files:
    28. # Extract isolate identifier from the file name
    29. isolate = os.path.basename(file_path).replace('.csv', '')
    30. df = pd.read_csv(file_path)
    31. # Ensure all expected columns are present, adding missing ones as empty columns
    32. for col in expected_columns:
    33. if col not in df.columns:
    34. df[col] = None
    35. # Extract relevant columns
    36. df = df[expected_columns]
    37. # Ensure consistent data types
    38. df = df.astype({"CHROM": str, "POS": int, "REF": str, "ALT": str, "TYPE": str, "EFFECT": str, "LOCUS_TAG": str, "GENE": str, "PRODUCT": str})
    39. # Add the isolate column with the ALT allele
    40. df[isolate] = df["ALT"]
    41. # Drop columns that are not needed for the summary
    42. df = df.drop(["ALT"], axis=1)
    43. if summary_df.empty:
    44. summary_df = df
    45. else:
    46. summary_df = pd.merge(summary_df, df, on=["CHROM", "POS", "REF", "TYPE", "EFFECT", "LOCUS_TAG", "GENE", "PRODUCT"], how="outer")
    47. # Fill missing values with the REF allele for each isolate column
    48. for isolate in isolates:
    49. if isolate in summary_df.columns:
    50. summary_df[isolate] = summary_df[isolate].fillna(summary_df["REF"])
    51. else:
    52. summary_df[isolate] = summary_df["REF"]
    53. # Rename columns to match the required format
    54. summary_df = summary_df.rename(columns={
    55. "CHROM": "CHROM",
    56. "POS": "POS",
    57. "REF": "REF",
    58. "TYPE": "TYPE",
    59. "EFFECT": "Effect",
    60. "LOCUS_TAG": "Gene_name",
    61. "GENE": "Biotype",
    62. "PRODUCT": "Product"
    63. })
    64. # Replace any remaining None or NaN values in the non-isolate columns with empty strings
    65. summary_df = summary_df.fillna("")
    66. # Add empty columns for Impact, Functional_Class, Codon_change, Protein_and_nucleotide_change, Amino_Acid_Length
    67. summary_df["Impact"] = ""
    68. summary_df["Functional_Class"] = ""
    69. summary_df["Codon_change"] = ""
    70. summary_df["Protein_and_nucleotide_change"] = ""
    71. summary_df["Amino_Acid_Length"] = ""
    72. # Reorder columns
    73. cols = ["CHROM", "POS", "REF", "TYPE"] + isolates + ["Effect", "Impact", "Functional_Class", "Codon_change", "Protein_and_nucleotide_change", "Amino_Acid_Length", "Gene_name", "Biotype"]
    74. summary_df = summary_df.reindex(columns=cols)
    75. # Remove duplicate rows
    76. summary_df = summary_df.drop_duplicates()
    77. # Save the summary DataFrame to a CSV file
    78. output_file = os.path.join(base_directory, "summary_snps_indels.csv")
    79. summary_df.to_csv(output_file, index=False)
    80. print("Summary CSV file created successfully at:", output_file)
    81. if __name__ == "__main__":
    82. parser = argparse.ArgumentParser(description="Summarize SNPs and Indels from CSV files.")
    83. parser.add_argument("directory", type=str, help="Base directory containing the CSV files in subdirectories.")
    84. args = parser.parse_args()
    85. main(args.directory)
  30. ~/Scripts/merge_snps_indels.py

    1. import pandas as pd
    2. import argparse
    3. import os
    4. def merge_files(input_file1, input_file2, output_file):
    5. # Read the input files
    6. df1 = pd.read_csv(input_file1)
    7. df2 = pd.read_csv(input_file2, sep='\t')
    8. # Merge the dataframes on the 'POS' column, keeping only the rows that have common 'POS' values
    9. merged_df = pd.merge(df2, df1[['POS']], on='POS', how='inner')
    10. # Remove duplicate rows
    11. merged_df.drop_duplicates(inplace=True)
    12. # Save the merged dataframe to the output file
    13. merged_df.to_csv(output_file, index=False)
    14. print("Merged file created successfully at:", output_file)
    15. if __name__ == "__main__":
    16. parser = argparse.ArgumentParser(description="Merge two SNP and Indel files based on the 'POS' column.")
    17. parser.add_argument("input_file1", type=str, help="Path to the first input file (summary_snps_indels.csv).")
    18. parser.add_argument("input_file2", type=str, help="Path to the second input file (All_SNPs_indels_annotated.txt).")
    19. parser.add_argument("output_file", type=str, help="Path to the output file.")
    20. args = parser.parse_args()
    21. merge_files(args.input_file1, args.input_file2, args.output_file)
  31. ~/Scripts/convert_fasta_to_clustal.py

    1. #!/usr/bin/env python3
    2. from Bio import AlignIO
    3. import sys
    4. # Check if the correct number of arguments is provided
    5. if len(sys.argv) != 3:
    6. print("Usage: python convert_fasta_to_clustal.py <input_fasta> <output_clustal>")
    7. sys.exit(1)
    8. # Get the input and output file names from command-line arguments
    9. input_file = sys.argv[1]
    10. output_file = sys.argv[2]
    11. # Read the input FASTA file
    12. alignment = AlignIO.read(input_file, "fasta")
    13. # Write the alignment to the output CLUSTAL file
    14. with open(output_file, "w") as out_file:
    15. AlignIO.write(alignment, out_file, "clustal")
    16. print(f"Conversion complete! The CLUSTAL file is saved as {output_file}.")
  32. ~/Scripts/convert_clustal_to_clustal.py

    1. #!/usr/bin/env python3
    2. from Bio import AlignIO
    3. import sys
    4. # Check if correct arguments are provided
    5. if len(sys.argv) != 3:
    6. print("Usage: python convert_clustal_to_fasta.py <input_clustal> <output_clustal>")
    7. sys.exit(1)
    8. # Get the input and output file names from command-line arguments
    9. input_file = sys.argv[1]
    10. output_file = sys.argv[2]
    11. # Read the CLUSTAL alignment
    12. alignment = AlignIO.read(input_file, "clustal")
    13. # Extract sequences (assuming three sequences)
    14. op_seq = alignment[0].seq
    15. hsv1_seq = alignment[1].seq
    16. hsv_klinik_seq = alignment[2].seq
    17. # Make sure the sequences have the same length
    18. if len(op_seq) != len(hsv1_seq) or len(op_seq) != len(hsv_klinik_seq):
    19. print("Error: Sequences have different lengths!")
    20. sys.exit(1)
    21. # Prepare new sequences for HSV1 and HSV-Klinik
    22. new_hsv1_seq = []
    23. new_hsv_klinik_seq = []
    24. # Iterate through each position of the sequences
    25. for i in range(len(op_seq)):
    26. op_base = op_seq[i]
    27. hsv1_base = hsv1_seq[i]
    28. hsv_klinik_base = hsv_klinik_seq[i]
    29. # Apply the rules for replacing bases in HSV1_S1-1 and HSV-Klinik_S2-1
    30. if hsv1_base in ['N', '-']:
    31. # Replace with OP297860.1 base
    32. new_hsv1_seq.append(op_base)
    33. else:
    34. # Otherwise, keep the original base
    35. new_hsv1_seq.append(hsv1_base)
    36. if hsv_klinik_base in ['N', '-']:
    37. # Replace with OP297860.1 base
    38. new_hsv_klinik_seq.append(op_base)
    39. else:
    40. # Otherwise, keep the original base
    41. new_hsv_klinik_seq.append(hsv_klinik_base)
    42. # Update the sequences in the alignment
    43. alignment[1].seq = "".join(new_hsv1_seq)
    44. alignment[2].seq = "".join(new_hsv_klinik_seq)
    45. # Write the modified alignment back to a file in CLUSTAL format
    46. with open(output_file, "w") as out_file:
    47. AlignIO.write(alignment, out_file, "clustal")
    48. print(f"Conversion complete! The modified CLUSTAL file is saved as {output_file}.")
  33. ~/Scripts/convert_clustal_to_fasta.py

    1. #!/usr/bin/env python3
    2. from Bio import AlignIO
    3. import sys
    4. # Check if the correct number of arguments is provided
    5. if len(sys.argv) != 3:
    6. print("Usage: python convert_clustal_to_fasta.py <input_clustal> <output_fasta>")
    7. sys.exit(1)
    8. # Get the input and output file names from command-line arguments
    9. input_file = sys.argv[1]
    10. output_file = sys.argv[2]
    11. # Read the input CLUSTAL file
    12. alignment = AlignIO.read(input_file, "clustal")
    13. # Write the alignment to the output FASTA file
    14. with open(output_file, "w") as out_file:
    15. AlignIO.write(alignment, out_file, "fasta")
    16. print(f"Conversion complete! The FASTA file is saved as {output_file}.")
  34. ~/Scripts/filter_isnv.py

    1. #!/usr/bin/env python3
    2. import sys
    3. import pandas as pd
    4. # Check for correct command-line arguments
    5. if len(sys.argv) != 3:
    6. print("Usage: python filter_isnv.py <input_file> <min_freq>")
    7. sys.exit(1)
    8. input_file = sys.argv[1]
    9. min_freq = float(sys.argv[2])
    10. # Load the data into a pandas DataFrame
    11. data = pd.read_csv(input_file, sep='\t')
    12. # Filter out records where all records at the same position have iSNV_freq < min_freq
    13. def filter_isnv(data, min_freq):
    14. # Group data by 'chr' and 'pos' to check records at each position
    15. grouped = data.groupby(['chr', 'pos'])
    16. # Keep groups where at least one record has iSNV_freq >= min_freq
    17. filtered_data = grouped.filter(lambda x: any(x['iSNV_freq'] >= min_freq))
    18. return filtered_data
    19. # Apply the filter
    20. filtered_data = filter_isnv(data, min_freq)
    21. # Output the filtered data
    22. output_file = "filtered_" + input_file
    23. filtered_data.to_csv(output_file, sep='\t', index=False)
    24. print(f"Filtered data saved to {output_file}")

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum