Structural Variant Calling for Nanopore Sequencing (edited)

gene_x 0 like s 398 view s

Tags: variation, pipeline

  1. install mambaforge https://conda-forge.org/miniforge/ (recommended)
    1. #download Mambaforge-24.9.2-0-Linux-x86_64.sh from website
    2. chmod +x Mambaforge-24.9.2-0-Linux-x86_64.sh
    3. ./Mambaforge-24.9.2-0-Linux-x86_64.sh
    4. To activate this environment, use:
    5. micromamba activate /home/jhuang/mambaforge
    6. Or to execute a single command in this environment, use:
    7. micromamba run -p /home/jhuang/mambaforge mycommand
    8. installation finished.
    9. Do you wish to update your shell profile to automatically initialize conda?
    10. This will activate conda on startup and change the command prompt when activated.
    11. If you'd prefer that conda's base environment not be activated on startup,
    12. run the following command when conda is activated:
    13. conda config --set auto_activate_base false
    14. You can undo this by running `conda init --reverse $SHELL`? [yes|no]
    15. [no] >>> yes
    16. no change /home/jhuang/mambaforge/condabin/conda
    17. no change /home/jhuang/mambaforge/bin/conda
    18. no change /home/jhuang/mambaforge/bin/conda-env
    19. no change /home/jhuang/mambaforge/bin/activate
    20. no change /home/jhuang/mambaforge/bin/deactivate
    21. no change /home/jhuang/mambaforge/etc/profile.d/conda.sh
    22. no change /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    23. no change /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    24. no change /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    25. no change /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    26. no change /home/jhuang/mambaforge/etc/profile.d/conda.csh
    27. modified /home/jhuang/.bashrc
    28. ==> For changes to take effect, close and re-open your current shell. <==
    29. no change /home/jhuang/mambaforge/condabin/conda
    30. no change /home/jhuang/mambaforge/bin/conda
    31. no change /home/jhuang/mambaforge/bin/conda-env
    32. no change /home/jhuang/mambaforge/bin/activate
    33. no change /home/jhuang/mambaforge/bin/deactivate
    34. no change /home/jhuang/mambaforge/etc/profile.d/conda.sh
    35. no change /home/jhuang/mambaforge/etc/fish/conf.d/conda.fish
    36. no change /home/jhuang/mambaforge/shell/condabin/Conda.psm1
    37. no change /home/jhuang/mambaforge/shell/condabin/conda-hook.ps1
    38. no change /home/jhuang/mambaforge/lib/python3.12/site-packages/xontrib/conda.xsh
    39. no change /home/jhuang/mambaforge/etc/profile.d/conda.csh
    40. no change /home/jhuang/.bashrc
    41. No action taken.
    42. WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    43. WARNING conda.common.path.windows:_path_to(100): cygpath is not available, fallback to manual path conversion
    44. Added mamba to /home/jhuang/.bashrc
    45. ==> For changes to take effect, close and re-open your current shell. <==
    46. Thank you for installing Mambaforge!
    47. Close your terminal window and open a new one, or run:
    48. #source ~/mambaforge/bin/activate
    49. conda --version
    50. mamba --version
    51. https://github.com/conda-forge/miniforge/releases
    52. Note
    53. * After installation, please make sure that you do not have the Anaconda default channels configured.
    54. conda config --show channels
    55. conda config --remove channels defaults
    56. conda config --add channels conda-forge
    57. conda config --show channels
    58. conda config --set channel_priority strict
    59. #conda clean --all
    60. conda config --remove channels biobakery
    61. * !!!!Do not install anything into the base environment as this might break your installation. See here for details.!!!!
    62. # --Deprecated method: mamba installing on conda--
    63. #conda install -n base --override-channels -c conda-forge mamba 'python_abi=*=*cp*'
    64. # * Note that installing mamba into any other environment than base is not supported.
    65. #
    66. #conda activate base
    67. #conda install conda
    68. #conda uninstall mamba
    69. #conda install mamba

2: install required Tools on the mamba env

  1. * Sniffles2: Detect structural variants, including transposons, from long-read alignments.
  2. * RepeatModeler2: Identify and classify transposons de novo.
  3. * RepeatMasker: Annotate known transposable elements using transposon libraries.
  4. * SVIM: An alternative structural variant caller optimized for long-read sequencing, if needed.
  5. * SURVIVOR: Consolidate structural variants across samples for comparative analysis.
  6. mamba deactivate
  7. # Create a new conda environment
  8. mamba create -n transposon_long python=3.6 -y
  9. # Activate the environment
  10. mamba activate transposon_long
  11. mamba install -c bioconda sniffles
  12. mamba install -c bioconda repeatmodeler repeatmasker
  13. # configure repeatmasker database
  14. mamba info --envs
  15. cd /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker
  16. #mamba install python=3.6
  17. mamba install -c bioconda svim
  18. mamba install -c bioconda survivor
  1. Test the installed tools

    1. # Check versions
    2. sniffles --version
    3. RepeatModeler -h
    4. RepeatMasker -h
    5. svim --help
    6. SURVIVOR --help
    7. mamba install -c conda-forge perl r
  2. Data Preparation

    1. Raw Signal Data: Nanopore devices generate electrical signal data as DNA passes through the nanopore.
    2. Basecalling: Tools like Guppy or Dorado are used to convert raw signals into nucleotide sequences (FASTQ files).
  3. Preprocessing

    1. Quality Filtering: Remove low-quality reads using tools like Filtlong or NanoFilt.
    2. Adapter Trimming: Identify and remove sequencing adapters with tools like Porechop.
  4. (Optional) Variant Calling for SNP and Indel Detection:

    1. Tools like Medaka, Longshot, or Nanopolish analyze the aligned reads to identify SNPs and small indels.
  5. Alignment and Structural Variant Calling: Tools such as Sniffles or SVIM detect large insertions, deletions, and other structural variants. 使用长读长测序工具如 SVIM 或 Sniffles 检测结构变异(e.g. 散在性重复序列)。

    1. #NOTE that the ./batch1_depth25/trycycler_WT/reads.fastq and F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz are the same!
    2. ./4/1.Cleandata/4.filtered_reads.fq.gz
    3. ./3/1.Cleandata/3.filtered_reads.fq.gz
    4. ./2/1.Cleandata/2.filtered_reads.fq.gz
    5. ./8/1.Cleandata/8.filtered_reads.fq.gz
    6. ./5/1.Cleandata/5.filtered_reads.fq.gz
    7. ./WT/1.Cleandata/WT.filtered_reads.fq.gz
    8. ./9/1.Cleandata/9.filtered_reads.fq.gz
    9. ./10/1.Cleandata/10.filtered_reads.fq.gz
    10. ./7/1.Cleandata/7.filtered_reads.fq.gz
    11. ./1/1.Cleandata/1.filtered_reads.fq.gz
    12. # -- Alignment and Detect structural variants in each sample using SVIM (failed due to the strange output from SVIM!)
    13. #mamba install -c bioconda ngmlr
    14. mamba install -c bioconda svim
    15. for sample in WT 1 2 3 4 5 7 8 9 10; do
    16. svim reads --aligner ngmlr --nanopore svim_reads_ngmlr_${sample} F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz CP020463.fasta --cores 10;
    17. done
    18. for sample in WT 1 2 3 4 5 7 8 9 10; do
    19. for sample in 1; do
    20. #INS,INV,DUP:TANDEM,DUP:INT,BND
    21. svim reads svim_reads_minimap2_${sample} F24A430001437_BACctmoD/BGI_result/Separate/${sample}/1.Cleandata/${sample}.filtered_reads.fq.gz CP020463.fasta --aligner minimap2 --nanopore --cores 20 --types INS --min_sv_size 100 --sequence_allele --insertion_sequences --read_names;
    22. done
    23. #svim alignment svim_alignment_minmap2_1_re 1.sorted.bam CP020463_.fasta --types INS --sequence_alleles --insertion_sequences --read_names
    24. # -- Results1: Detect structural variants using Minamap2+Sniffles2:
    25. Minimap2: A commonly used aligner for nanopore sequencing data.
    26. Align Long Reads to the WT Reference using Minimap2
    27. for sample in WT 1 2 3 4 5 7 8 9 10; do
    28. minimap2 --MD -t 60 -ax map-ont CP020463.fasta ./batch1_depth25/trycycler_${sample}/reads.fastq | samtools sort -o ${sample}.sorted.bam
    29. samtools index ${sample}.sorted.bam
    30. done
    31. #sniffles -m WT.sorted.bam -v WT.vcf -s 10 -l 50 -t 60
    32. # -s 20: Requires at least 20 reads to support an SV for reporting. --> 10
    33. # -l 50: Reports SVs that are at least 50 base pairs long.
    34. # -t 4: Uses 4 threads for faster processing. --> 60
    35. for sample in WT 1 2 3 4 5 7 8 9 10; do
    36. sniffles -m ${sample}.sorted.bam -v ${sample}.vcf -s 10 -l 50 -t 60
    37. #QUAL < 20 ||
    38. bcftools filter -e "INFO/SVTYPE != 'INS'" ${sample}.vcf > ${sample}_filtered.vcf
    39. done
    40. # -- Results2: Detect structural variants using NGMLR+Sniffles2
    41. for sample in WT 1 2 3 4 5 7 8 9 10; do
    42. #ERROR: No MD string detected! Check bam file! Otherwise generate using e.g. samtools. --> No results!
    43. #sniffles -m svim_reads_minimap2_${sample}/${sample}.filtered_reads.fq.minimap2.coordsorted.bam -v #sniffles_minimap2_${sample}.vcf -s 10 -l 50 -t 60
    44. bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_minimap2_${sample}.vcf > sniffles_minimap2_${sample}_filtered.vcf
    45. #Using
    46. sniffles -m svim_reads_ngmlr_${sample}/${sample}.filtered_reads.fq.ngmlr.coordsorted.bam -v sniffles_ngmlr_${sample}.vcf -s 10 -l 50 -t 60
    47. bcftools filter -e "INFO/SVTYPE != 'INS'" sniffles_ngmlr_${sample}.vcf > sniffles_ngmlr_${sample}_filtered.vcf
    48. done
    49. # -- Compare the results1 and results2, and check them each position in IGV!
    50. #minimap2+sniffles2
    51. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 WT_filtered.vcf | grep -v "##"
    52. POS
    53. 1855752
    54. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 1_filtered.vcf | grep -v "##"
    55. POS
    56. 529416
    57. 1855752
    58. 2422820
    59. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 2_filtered.vcf | grep -v "##"
    60. POS
    61. 529416
    62. 1855752
    63. 2422820
    64. 2424590
    65. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 3_filtered.vcf | grep -v "##"
    66. POS
    67. 529416
    68. 529416
    69. 529418
    70. 1855752
    71. 2422820
    72. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 4_filtered.vcf | grep -v "##"
    73. POS
    74. 55682
    75. 529416
    76. 1855752
    77. 2422820
    78. 2424590
    79. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 5_filtered.vcf | grep -v "##"
    80. POS
    81. 529416
    82. 1855752
    83. 2422820
    84. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 7_filtered.vcf | grep -v "##"
    85. POS
    86. 518217
    87. 1855752
    88. 2424590
    89. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 8_filtered.vcf | grep -v "##"
    90. POS
    91. 529416
    92. 1855752
    93. 2422820
    94. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 9_filtered.vcf | grep -v "##"
    95. POS
    96. 529416
    97. 1855752
    98. 2422820
    99. 2424590
    100. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 10_filtered.vcf | grep -v "##"
    101. POS
    102. 529416
    103. 1855752
    104. 2422818
    105. 2424590
    106. #ngmlr+sniffles2
    107. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_WT_filtered.vcf | grep -v "##"
    108. POS
    109. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_1_filtered.vcf | grep -v "##"
    110. POS
    111. 529419
    112. 2422819
    113. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_2_filtered.vcf | grep -v "##"
    114. POS
    115. 529418
    116. 2422820
    117. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_3_filtered.vcf | grep -v "##"
    118. POS
    119. 529418
    120. 2422820
    121. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_4_filtered.vcf | grep -v "##"
    122. POS
    123. 529419
    124. 2422820
    125. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_5_filtered.vcf | grep -v "##"
    126. POS
    127. 529419
    128. 2422820
    129. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_7_filtered.vcf | grep -v "##"
    130. POS
    131. 518219
    132. 2422820
    133. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_8_filtered.vcf | grep -v "##"
    134. POS
    135. 529419
    136. 2422820
    137. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_9_filtered.vcf | grep -v "##"
    138. POS
    139. 529419
    140. 2422820
    141. (base) jhuang@WS-2290C:/mnt/md1/DATA_md1/Data_Patricia_Transposon$ cut -d$'\t' -f2 sniffles_ngmlr_10_filtered.vcf | grep -v "##"
    142. POS
    143. 529418
    144. 2422820
    145. #~/Tools/csv2xls-0.4/csv_to_xls.py sniffles_ngmlr_WT_filtered.vcf sniffles_ngmlr_1_filtered.vcf sniffles_ngmlr_2_filtered.vcf sniffles_ngmlr_3_filtered.vcf sniffles_ngmlr_4_filtered.vcf sniffles_ngmlr_5_filtered.vcf sniffles_ngmlr_7_filtered.vcf sniffles_ngmlr_8_filtered.vcf sniffles_ngmlr_9_filtered.vcf sniffles_ngmlr_10_filtered.vcf -d$'\t' -o putative_transposons2.xls
    146. # -- Filtering low-complexity insertions using RepeatMasker (TODO: how to use RepeatModeler to generate own lib?)
    147. python vcf_to_fasta.py variants.vcf variants.fasta
    148. #python filter_low_complexity.py variants.fasta filtered_variants.fasta retained_variants.fasta
    149. #Using RepeatMasker to filter the low-complexity fasta, the used h5 lib is
    150. /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 #1.9G
    151. python /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/famdb.py -i /home/jhuang/mambaforge/envs/transposon_long/share/RepeatMasker/Libraries/Dfam.h5 names 'bacteria' | head
    152. Exact Matches
    153. =============
    154. 2 bacteria (blast name), Bacteria <bacteria> (scientific name), eubacteria (genbank common name), Monera <bacteria> (in-part), Procaryotae <bacteria> (in-part), Prokaryota <bacteria> (in-part), Prokaryotae <bacteria> (in-part), prokaryote <bacteria> (in-part), prokaryotes <bacteria> (in-part)
    155. Non-exact Matches
    156. =================
    157. 1783272 Terrabacteria group (scientific name)
    158. 91061 Bacilli (scientific name), Bacilli Ludwig et al. 2010 (authority), Bacillus/Lactobacillus/Streptococcus group (synonym), Firmibacteria (synonym), Firmibacteria Murray 1988 (authority)
    159. 1239 Bacillaeota (synonym), Bacillaeota Oren et al. 2015 (authority), Bacillota (synonym), Bacillus/Clostridium group (synonym), clostridial firmicutes (synonym), Clostridium group firmicutes (synonym), Firmacutes (synonym), firmicutes (blast name), Firmicutes (scientific name), Firmicutes corrig. Gibbons and Murray 1978 (authority), Low G+C firmicutes (synonym), low G+C Gram-positive bacteria (common name), low GC Gram+ (common name)
    160. Summary of Classes within Firmicutes:
    161. * Bacilli (includes many common pathogenic and non-pathogenic Gram-positive bacteria, taxid=91061)
    162. * Bacillus (e.g., Bacillus subtilis, Bacillus anthracis)
    163. * Staphylococcus (e.g., Staphylococcus aureus, Staphylococcus epidermidis)
    164. * Streptococcus (e.g., Streptococcus pneumoniae, Streptococcus pyogenes)
    165. * Listeria (e.g., Listeria monocytogenes)
    166. * Clostridia (includes many anaerobic species like Clostridium and Clostridioides)
    167. * Erysipelotrichia (intestinal bacteria, some pathogenic)
    168. * Tissierellia (less-studied, veterinary relevance)
    169. * Mollicutes (cell wall-less, includes Mycoplasma species)
    170. * Negativicutes (includes some Gram-negative, anaerobic species)
    171. RepeatMasker -species Bacilli -pa 4 -xsmall variants.fasta
    172. python extract_unmasked_seq.py variants.fasta.masked unmasked_variants.fasta
    173. #bcftools filter -i 'QUAL>30 && INFO/SVLEN>100' variants.vcf -o filtered.vcf
    174. #
    175. #bcftools view -i 'SVTYPE="INS"' variants.vcf | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%INFO\n' > insertions.txt
    176. #mamba install -c bioconda vcf2fasta
    177. #vcf2fasta variants.vcf -o insertions.fasta
    178. #grep "SEQS" variants.vcf | awk '{ print $1, $2, $4, $5, $8 }' > insertions.txt
    179. #python3 filtering_low_complexity.py
    180. #
    181. #vcftools --vcf input.vcf --recode --out filtered_output --minSVLEN 100
    182. #bcftools filter -e 'INFO/SEQS ~ "^(G+|C+|T+|A+){4,}"' variants.vcf -o filtered.vcf
    183. # -- calculate the percentage of reads
    184. To calculate the percentage of reads that contain the insertion from the VCF entry, use the INFO and FORMAT fields provided in the VCF record.
    185. Step 1: Extract Relevant Information
    186. In the provided VCF entry:
    187. RE (Reads Evidence): 733 the total number of reads supporting the insertion.
    188. GT (Genotype): 1/1 this indicates a homozygous insertion, meaning all reads covering this region are expected to have the insertion.
    189. AF (Allele Frequency): 1 a 100% allele frequency, indicating that every read in this sample supports the insertion.
    190. DR (Depth Reference): 0 the number of reads supporting the reference allele.
    191. DV (Depth Variant): 733 the number of reads supporting the variant allele (insertion).
    192. Step 2: Calculate Percentage of Reads Supporting the Insertion
    193. Using the formula:
    194. Percentage of reads with insertion=(DVDR+DV100
    195. Percentage of reads with insertion=(DR+DVDV​)×100
    196. Substitute the values:
    197. Percentage=(7330+733100=100%
    198. Percentage=(0+733733​)×100=100%
    199. Conclusion
    200. Based on the VCF record, 100% of the reads support the insertion, indicating that the insertion is fully present in the sample (homozygous insertion). This is consistent with the AF=1 and GT=1/1 fields.
  6. (failed) using own scripts direct analyze the bam-file via cigarString (failed due to too many short insertions!)

    1. transposons.fasta is a file containing the transposon sequences in FASTA format.
    2. python your_script.py input.bam reference.fasta transposons.fasta
    3. #Transposon_Sequence Insertion_Frequency
    4. #Tn5 10
    5. #Tn10 5
    6. #Unknown 3
    7. python putative_transposons_with_counts.py mapping_WT.sorted.bam CP020463.fasta
    8. rule trim_short_reads:
    9. input:
    10. "/data/short-reads.fq.gz"
    11. output:
    12. "/data/trimmed-short-reads.fasta"
    13. shell:
    14. "python3 trim_by_tag_length.py /data/short-reads.fq.gz 10 > /data/trimmed-short-reads.fasta"
    15. rule trim_long_reads:
    16. input:
    17. "/data/long-reads.fq.gz"
    18. output:
    19. "/data/trimmed-long-reads.fasta"
    20. shell:
    21. "python3 trim_by_tag_length.py /data/long-reads.fq.gz 92 > /data/trimmed-long-reads.fasta"
    22. rule install_bwa:
    23. output:
    24. "bwa-mem2-2.0pre2_x64-linux/bwa-mem2"
    25. shell:
    26. "curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre2/bwa-mem2-2.0pre2_x64-linux.tar.bz2 | tar jxf -"
    27. rule map_short_reads:
    28. input:
    29. "bwa-mem2-2.0pre2_x64-linux/bwa-mem2",
    30. "/data/reference.fasta",
    31. "/data/trimmed-short-reads.fasta"
    32. output:
    33. "/data/mapping.sam"
    34. shell:
    35. """
    36. bwa-mem2-2.0pre2_x64-linux/bwa-mem2 index /data/reference.fasta
    37. bwa-mem2-2.0pre2_x64-linux/bwa-mem2 mem /data/reference.fasta /data/trimmed-short-reads.fasta > /data/mapping.sam
    38. """
    39. rule map_long_reads:
    40. input:
    41. "/data/reference.fasta",
    42. "/data/trimmed-long-reads.fasta"
    43. output:
    44. "/data/mapping.bam"
    45. conda:
    46. "minimap2.yml"
    47. shell:
    48. """
    49. minimap2 -x map-ont -d reference /data/reference.fasta > /dev/null 2>&1
    50. minimap2 -c -a -o /data/mapping.nonunique.sam -N 1 -x map-ont reference /data/trimmed-long-reads.fasta
    51. samtools view -bq 1 /data/mapping.nonunique.sam > /data/mapping.bam
    52. """
    53. rule convert_sam_to_bam:
    54. input:
    55. "/data/mapping.sam"
    56. output:
    57. "/data/mapping.bam",
    58. conda:
    59. "samtools.yml"
    60. shell:
    61. "samtools view -S -b /data/mapping.sam > /data/mapping.bam"
    62. rule get_unmapped_reads:
    63. input:
    64. "/data/mapping.bam"
    65. output:
    66. "/data/mapping.sorted.bam"
    67. conda:
    68. "samtools.yml"
    69. shell:
    70. """
    71. # samtools view -f 4 /data/mapping.bam > /data/unmapped.sam
    72. # samtools view -S -b /data/unmapped.sam > /data/unmapped.bam
    73. # samtools bam2fq /data/unmapped.bam | seqtk seq -A - > /data/unmapped.fa
    74. samtools sort /data/mapping.bam -o /data/mapping.sorted.bam
    75. samtools index /data/mapping.sorted.bam
    76. """
    77. rule create_insertion_plot:
    78. input:
    79. "/data/mapping.sorted.bam"
    80. output:
    81. "/data/summary-stats.tsv"
    82. shell:
    83. """
    84. python3 ~/Scripts/sam_to_insert_plot.py /data/mapping.sorted.bam /data/reference.fasta > /data/summary-stats.tsv
    85. """
  7. Polishing of assembly: Use tools like Medaka to refine variant calls by leveraging consensus sequences derived from nanopore data.

    1. mamba install -c bioconda medaka
    2. medaka-consensus -i aligned_reads.bam -r reference.fasta -o polished_output -t 4
  8. Compare Insertions Across Samples

    1. Merge Variants Across Samples: Use SURVIVOR to merge and compare the detected insertions in all samples against the WT:
    2. SURVIVOR merge input_vcfs.txt 1000 1 1 1 0 30 merged.vcf
    3. Input: List of VCF files from Sniffles2.
    4. Output: A consolidated VCF file with shared and unique variants.
    5. Filter WT Insertions:
    6. Identify transposons present only in samples 19 by subtracting WT variants using bcftools:
    7. bcftools isec WT.vcf merged.vcf -p comparison_results
  9. Validate and Visualize

    1. Visualize with IGV: Use IGV to inspect insertion sites in the alignment and confirm quality.
    2. igv.sh
    3. Validate Findings:
    4. Perform PCR or additional sequencing for key transposon insertion sites to confirm results.
  10. Alternatives to TEPID for Long-Read Data

    1. If youre looking for transposon-specific tools for long reads:
    2. REPET: A robust transposon annotation tool compatible with assembled genomes.
    3. EDTA (Extensive de novo TE Annotator):
    4. A pipeline to identify, classify, and annotate transposons.
    5. Works directly on your assembled genomes.
    6. perl EDTA.pl --genome WT.fasta --type all
  11. The WT.vcf file in the pipeline is generated by detecting structural variants (SVs) in the wild-type (WT) genome aligned against itself or using it as a baseline reference. Here’s how you can generate the WT.vcf:

    1. Steps to Generate WT.vcf
    2. 1. Align WT Reads to the WT Reference Genome
    3. The goal here is to create an alignment of the WT sequencing data to the WT reference genome to detect any self-contained structural variations, such as native insertions, deletions, or duplications.
    4. Command using Minimap2:
    5. minimap2 -ax map-ont WT.fasta WT_reads.fastq | samtools sort -o WT.sorted.bam
    6. Index the BAM file:
    7. samtools index WT.sorted.bam
    8. 2. Detect Structural Variants with Sniffles2
    9. Run Sniffles2 on the WT alignment to call structural variants:
    10. sniffles --input WT.sorted.bam --vcf WT.vcf
    11. This step identifies:
    12. Native transposons and insertions present in the WT genome.
    13. Other structural variants that are part of the reference genome or sequencing artifacts.
    14. Key parameters to consider:
    15. --min_support: Adjust based on your WT sequencing coverage.
    16. --max_distance: Define proximity for merging variants.
    17. --min_length: Set a minimum SV size (e.g., >50 bp for transposons).
  12. Clean and Filter the WT.vcf, Variant Filtering: Remove low-confidence variants based on read depth, quality scores, or allele frequency.

    1. To ensure the WT.vcf only includes relevant transposons or SVs:
    2. Use bcftools or similar tools to filter out low-confidence variants:
    3. bcftools filter -e "QUAL < 20 || INFO/SVTYPE != 'INS'" WT.vcf > WT_filtered.vcf
    4. bcftools filter -e "QUAL < 1 || INFO/SVTYPE != 'INS'" 1_.vcf > 1_filtered_.vcf
  13. NOTE that in this pipeline, the WT.fasta (reference genome) is typically a high-quality genome sequence from a database or a well-annotated version of your species' genome. It is not assembled from the WT.fastq sequencing reads in this context. Here's why:

    1. Why Use a Reference Genome (WT.fasta) from a Database?
    2. Higher Quality and Completeness:
    3. Database references (e.g., NCBI, Ensembl) are typically well-assembled, highly polished, and annotated. They serve as a reliable baseline for variant detection.
    4. Consistency:
    5. Using a standard reference ensures consistent comparisons across your WT and samples (19). Variants detected will be relative to this reference, not influenced by possible assembly errors.
    6. Saves Time:
    7. Assembling a reference genome from WT reads requires significant computational effort. Using an existing reference streamlines the analysis.
    8. Alternative: Assembling WT from FASTQ
    9. If you dont have a high-quality reference genome (WT.fasta) and must rely on your WT FASTQ reads:
    10. Assemble the genome from your WT.fastq:
    11. Use long-read assemblers like Flye, Canu, or Shasta to create a draft genome.
    12. flye --nano-raw WT.fastq --out-dir WT_assembly --genome-size <size>
    13. Polish the assembly using tools like Racon (with the same reads) or Medaka for higher accuracy.
    14. Use the assembled and polished genome as your WT.fasta reference for further steps.
    15. Key Takeaways:
    16. If you have access to a reliable, high-quality reference genome, use it as the WT.fasta.
    17. Only assemble WT.fasta from raw reads (WT.fastq) if no database reference is available for your organism.
  14. Annotate Transposable Elements: Tools like ANNOVAR or SnpEff provide functional insights into the detected variants.

    1. # -- (successful!) MANUALLY Search for all found insertion sequences at https://tncentral.ncc.unesp.br/blast/ !
    2. # Or use the program available at https://github.com/danillo-alvarenga/tncomp_finder if there are numerous matches.
    3. #https://tncentral.ncc.unesp.br/report/te/Tn551-Y13600.1
    4. # -- (failed!) try TEPID for annotation
    5. mamba install tepid=0.10 -c bioconda
    6. #(tepid_env)
    7. for sample in WT 1 2 3 4 5 7 8 9 10; do
    8. tepid-map-se -x CP020463 -p 10 -n ${sample}_tepid -q ../batch1_depth25/trycycler_${sample}/reads.fastq;
    9. tepid-discover -k -i -p 10 -n ${sample}_tepid -c ${sample}_tepid --se;
    10. done
    11. tepid-discover -k -i -p 10 -n 1_tepid -c 1.sorted.bam --se;
    12. tepid-refine [-h] [--version] [-k] [-i INSERTIONS] [-d DELETIONS]
    13. [-p PROC] -t TE -n NAME -c CONC -s SPLIT -a AL
    14. # -- (failed!) try EDTA for annotation
    15. https://github.com/oushujun/EDTA
    16. (transposon_long) mamba install -c conda-forge -c bioconda edta
    17. mamba install bioconda::rmblast # cd RepeatMasker; ./configure
    18. EDTA.pl --genome CP020463.fasta --species others --threads 40
    19. For general-purpose TE annotation: EDTA, RepeatMasker, or RepeatScout are your best options.
    20. For de novo repeat identification: RepeatScout is highly effective.
    21. For LTR retrotransposons: Use LTR_retriever.
    22. For bacterial-specific annotations: Transposome, TEfinder, and ISfinder can be useful.
  15. Validation: Cross-validate with short-read sequencing data if available.

like unlike

点赞本文的读者

还没有人对此文章表态


本文有评论

没有评论

看文章,发评论,不要沉默


© 2023 XGenes.com Impressum