gene_x 0 like s 7 view s
Tags: pipeline
There are several alternative R packages and tools to perform motif enrichment analysis for RNA-binding proteins (RBPs), beyond PWMEnrich::motifEnrichment(). Here are the most notable ones:
| Tool / Package | Enrichment | Custom Motifs | CLI or R? | RNA-specific? |
| ------------------------ | ----------------- | --------------- | --------- | -------------- |
| **PWMEnrich** | ✅ | ✅ | R | ✅ | --> tried (see pipeline.v1-block3)!
| **RBPmap** | ✅ | ❌ (uses own db) | Web/CLI | ✅ | --> tried RBPmap, but it is too slow!
| **Biostrings/TFBSTools** | ❌ (only scanning) | ✅ | R | ❌ | #ATtRACT+Biostrings/TFBSTools (tried, pipeline.v1-block3)
| **rmap** | ✅ (CLIP-based) | ❌ | R | ✅ |
| **Homer** | ✅ | ✅ | CLI | ⚠ RNA optional |
| **MEME (AME, FIMO)** | ✅ | ✅ | Web/CLI | ⚠ Generic | --> Finally using ATtRACT+FIMO, AMD has BUG, not runnable!
#For me it was suggested to use “RBPmap” or “GraphProt” to do this analysis.
Get 3UTR.fasta, 5UTR.fasta, CDS.fasta and transcripts.fasta
mRNA Transcript
┌────────────┬────────────┬────────────┐
│ 5′ UTR │ CDS │ 3′ UTR │
└────────────┴────────────┴────────────┘
↑ ↑ ↑ ↑
Start Start Stop End
of Codon Codon of
Transcript Transcript
✅ Option 1: Use GENCODE and python scripts (CHOSEN!)
#Input: up- and down-, all-regulated files
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt #20086
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt #634
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt #23832
~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt #375
#Filtering the down-regulated genes to include only protein_coding genes before extracting 3' UTRs, because
#1. Only protein_coding genes have well-annotated 3' UTRs
#3' UTRs are defined as the region after the CDS (coding sequence) and before the poly-A tail.
#Non-coding RNAs (e.g., lncRNA, snoRNA, miRNA precursors) do not have CDS, and therefore don't have canonical 3' UTRs.
#2. In GENCODE, most UTR annotations are only provided for transcripts of gene_type = "protein_coding".
cd ~/DATA/Data_Ute/RBPs_analysis/extract_3UTR_5UTR_CDS_transcript
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down_protein_coding.txt
#Visit and Download: GENCODE FTP site https://www.gencodegenes.org/human/
* GTF annotation file (e.g., gencode.v48.annotation.gtf.gz)
* Corresponding genome FASTA (e.g., GRCh38.primary_assembly.genome.fa.gz)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/gencode.v48.annotation.gtf.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_48/GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v48.annotation.gtf.gz
gunzip GRCh38.primary_assembly.genome.fa.gz
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_down
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_up #5988
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-down_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_down #93
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-up_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_up #6538
✅ Option 2-5 see at the end!
Why 3' UTR?
🧬 miRNA, RBP, or translation/post-transcriptional regulation
➡️ Use 3' UTR sequences
Because:
Most miRNA binding and many RBP motifs are located in the 3' UTR.
It’s the primary region for mRNA stability, localization, and translation regulation.
🧠 Example: You're looking for binding enrichment of miRNAs or RNA-binding proteins (PUM, HuR, etc.)
✅ Input = 3UTR.fasta
🧪 If you're testing PBRs related to:
- Translation initiation, upstream ORFs, or 5' cap interaction:
➡️ Use 5' UTR
- Coding mutations, protein-level motifs, or translational efficiency:
➡️ Use CDS
- General transcriptome-wide motif search (no preference):
➡️ Use transcripts, or test all regions separately to localize signal
Recommended Workflow with RBPmap https://rbpmap.technion.ac.il (Too slow!)
RBPmap itself does not compute enrichment p-values or FDR; it's a motif scanning tool.
To get statistically meaningful RBP enrichments, combine RBPmap with custom permutation testing or Fisher’s exact test + multiple testing correction.
1. Prepare foreground (target) and background sequences
Extract 3′ UTRs of:
📉 Downregulated mRNAs (foreground) — likely targeted by upregulated miRNAs
⚪ A control set of 3′ UTRs — e.g., non-differentially expressed protein-coding genes
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/MKL-1_wt.EV_vs_parental-all.txt > MKL-1_wt.EV_vs_parental-all_protein_coding.txt
grep ",\"protein_coding\"," ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/degenes/WaGa_wt.EV_vs_parental-all.txt > WaGa_wt.EV_vs_parental-all_protein_coding.txt
cut -d',' -f1 MKL-1_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt #19239
cut -d',' -f1 MKL-1_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt #5988
cut -d',' -f1 MKL-1_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt #112
cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
grep -Ff background_genes.txt MKL-1_wt.EV_vs_parental-all_protein_coding.txt > MKL-1_wt.EV_vs_parental-background_protein_coding.txt #13139
cut -d',' -f1 WaGa_wt.EV_vs_parental-all_protein_coding.txt | sort > all_genes.txt #19239
cut -d',' -f1 WaGa_wt.EV_vs_parental-up_protein_coding.txt | sort > up_genes.txt #6538
cut -d',' -f1 WaGa_wt.EV_vs_parental-down_protein_coding.txt | sort > down_genes.txt #93
cat up_genes.txt down_genes.txt | sort | uniq > regulated_genes.txt
comm -23 all_genes.txt regulated_genes.txt > background_genes.txt
grep -Ff background_genes.txt WaGa_wt.EV_vs_parental-all_protein_coding.txt > WaGa_wt.EV_vs_parental-background_protein_coding.txt #12608
python extract_transcript_parts.py MKL-1_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa MKL-1_background
python extract_transcript_parts.py WaGa_wt.EV_vs_parental-background_protein_coding.txt ~/REFs/gencode.v48.annotation.gtf ~/REFs/GRCh38.primary_assembly.genome.fa WaGa_background
foreground.fasta: 你的目标(前景)序列,例如下调基因的 3′UTRs。
background.fasta: 你的背景对照序列,例如未显著差异表达的基因的 3′UTRs。
2. Run RBPmap separately on both sets (in total of 6 calculations)
* Submit both sets of UTRs to RBPmap.
* Use the same settings (e.g., “human genome”, “high stringency”, "Apply conservation filter" etc.)
* Choose all RBPs
* Download motif match outputs for both sets
3. Count motif hits per RBP in each set
You now have:
For each RBP:
a: number of target 3′ UTRs with a motif match
b: number of background 3′ UTRs with a motif match
c: total number of target 3′ UTRs
d: total number of background 3′ UTRs
4. Perform Fisher’s Exact Test per RBP
For each RBP, construct a 2x2 table:
Motif Present Motif Absent
Foreground (targets) a c - a
Background b d - b
5. Adjust p-values for multiple testing
Use Benjamini-Hochberg (FDR) correction (e.g., in Python or R) across all RBPs tested.
6.✅ Summary
Step Tool
Prepare Database of RNA-binding motifs ATtRACT
3′ UTR extraction extract_transcript_parts.py
Motif scan RBPmap or FIMO
Count motif hits Your own parser (Python or R)
Fisher’s exact test scipy.stats or fisher.test()
FDR correction multipletests() or p.adjust()
python rbp_enrichment.py rbpmap_downregulated.tsv rbpmap_background.tsv rbp_enrichment_results.csv
Quick Drop-In Plan (RBPmap Alternative with FIMO for motif scan)
1. [ATtRACT + FIMO (MEME suite)]
ATtRACT: Database of RNA-binding motifs.
FIMO: Fast and scriptable motif scanning tool.
#Download RBP motifs (PWM) from ATtRACT DB; Convert to MEME format (if needed); Use FIMO to scan UTR sequences
grep "Homo_sapiens" ATtRACT_db.txt > attract_human.txt
#cut -f12 attract_human.txt | sort | uniq > valid_ids.txt
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta MKL-1_background.filtered.3UTR.fasta
✅ 筛选完成: 总序列 = 70650
🧹 已移除过短序列 (<16 nt): 1760
🟢 保留有效序列 (≥16 nt): 68890
📁 新背景文件保存为: MKL-1_background.filtered.3UTR.fasta
# 检查背景文件中有多少序列:
grep -c "^>" MKL-1_background.filtered.3UTR.fasta
68890
# 检查背景 FIMO 命中的总序列数:
cut -f3 fimo_background_MKL-1_background/fimo.tsv | sort | uniq | wc -l
67841
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta MKL-1_up.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta MKL-1_down.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta WaGa_background.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta WaGa_up.filtered.3UTR.fasta
python filter_short_fasta.py ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta WaGa_down.filtered.3UTR.fasta
python convert_attract_pwm_to_meme.py
fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_down.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_foreground_MKL-1_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_up.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_background_MKL-1_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/MKL-1_background.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_foreground_WaGa_down attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_foreground_WaGa_up attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_up.3UTR.fasta
fimo --thresh 1e-4 --oc fimo_background_WaGa_background attract_human.meme ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_background.3UTR.fasta
#Explanation for the table from FIMO (Find Individual Motif Occurrences), which scans sequences to find statistically significant matches to known motifs (e.g., RNA or DNA binding sites).
Column Meaning
motif_id ID of the motif, as defined in the .meme file
motif_alt_id Alternative ID or name for the motif (may be blank or unused)
sequence_name Name of the sequence where the motif was found (e.g., gene
start Start position (1-based) of the motif match within the sequence
stop End position of the motif match
strand Strand on which the motif was found: + (forward) or - (reverse)
score Motif match score; higher scores indicate better matches
p-value Statistical significance of the match (lower is more significant)
q-value Adjusted p-value (False Discovery Rate corrected)
matched_sequence The actual sequence in the input that matches the motif
✅ Example Interpretation
1338 ENSG00000134871|ENST00000714397|3UTR 103 114 + 23.4126 5.96e-08 0.111 GGAGAGAAGGGA
motif_id: 1338 — a numeric ID from your motif file
sequence_name: ENSG00000134871|ENST00000714397|3UTR — refers to the gene, transcript, and region (3′ UTR)
start–stop: 103–114 — the motif occurs from position 103 to 114
strand: + — found on the positive strand
score: 23.41 — high score means strong motif match
p-value: 5.96e-08 — very statistically significant
q-value: 0.111 — FDR-corrected p-value
matched_sequence: GGAGAGAAGGGA — the actual sequence match in the UTR
💡 Tips
You can map motif_id to RBP (RNA-binding protein) names using an annotation file like ATtRACT_db.txt.
Typically, q-value < 0.05 is considered significant.
Duplicate matches in different transcripts of the same gene may occur and are valid.
Would you like help converting motif_id to RBP names for clarity?
🧠 In most biological contexts:
* Counting a motif as present multiple times because it's in several transcripts can inflate significance.
* If you're using Fisher's exact test (as in enrichment), this transcript-level duplication can bias results.
⚠️ Caveat: If you're studying isoform-specific regulation, then transcript-level data may be valuable and shouldn't be collapsed. But for most general RBP enrichment or gene expression studies, the gene-level collapse is preferred.
#Keep only one match per gene (based on Ensembl Gene ID like ENSG00000134871) for each RBP motif, even if multiple transcripts have hits.
#python filter_fimo_best_per_gene.py --input fimo_foreground/fimo.tsv --output fimo_foreground/fimo.filtered.tsv
convert_gtf_to_Gene_annotation_TSV_file.py #generate gene_annotation.tsv
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_MKL-1_down/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_MKL-1_down/fimo.filtered.tsv \
--output_annotated fimo_foreground_MKL-1_down/fimo.filtered.annotated.tsv
#21559
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_MKL-1_up/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_MKL-1_up/fimo.filtered.tsv \
--output_annotated fimo_foreground_MKL-1_up/fimo.filtered.annotated.tsv
#(736661 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_background_MKL-1_background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_background_MKL-1_background/fimo.filtered.tsv \
--output_annotated fimo_background_MKL-1_background/fimo.filtered.annotated.tsv
#(1869075 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_WaGa_down/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_WaGa_down/fimo.filtered.tsv \
--output_annotated fimo_foreground_WaGa_down/fimo.filtered.annotated.tsv
#(20364 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_foreground_WaGa_up/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_foreground_WaGa_up/fimo.filtered.tsv \
--output_annotated fimo_foreground_WaGa_up/fimo.filtered.annotated.tsv
#(805634 rows)
python filter_fimo_best_per_gene_annotated.py \
--input fimo_background_WaGa_background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_background_WaGa_background/fimo.filtered.tsv \
--output_annotated fimo_background_WaGa_background/fimo.filtered.annotated.tsv
#(1811615 rows)
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_MKL-1_up/fimo.filtered.tsv \
--fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
--output rbp_enrichment_MKL-1_up.csv \
--strategy inclusive
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_MKL-1_down/fimo.filtered.tsv \
--fimo_bg fimo_background_MKL-1_background/fimo.filtered.tsv \
--output rbp_enrichment_MKL-1_down.csv
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_WaGa_up/fimo.filtered.tsv \
--fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
--output rbp_enrichment_WaGa_up.csv
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_foreground_WaGa_down/fimo.filtered.tsv \
--fimo_bg fimo_background_WaGa_background/fimo.filtered.tsv \
--output rbp_enrichment_WaGa_down.csv
python plot_volcano.py --csv rbp_enrichment_MKL-1_up.csv --output MKL-1_volcano_up.pdf --title "Upregulated MKL-1"
python plot_rbp_heatmap.py \
--csvs rbp_enrichment_MKL-1_up.csv rbp_enrichment_MKL-1_down.csv \
--labels Upregulated Downregulated \
--output MKL-1_rbp_enrichment_heatmap.pdf
#Column Meaning
#a Number of unique foreground UTRs hit by the RBP
#b Number of unique background UTRs hit by the RBP
#c Total number of foreground UTRs
#d Total number of background UTRs (⬅️ this is the value you're asking about)
#p_value, fdr From Fisher's exact test on enrichment
#-- Get all genes the number 1621 refers to --
#AGO2,1621,5050,5732,12987,1.0,1.0 #MKL-1_up
#motif_ids are 414 and 399
grep "^414" fimo.filtered.annotated.tsv > AGO2.txt
grep "^399" fimo.filtered.annotated.tsv >> AGO2.txt
cut -d$'\t' -f11 AGO2.txt | sort -u > AGO2_uniq.txt
wc -l AGO2_uniq.txt
#1621 AGO2_uniq.txt
#工具 功能 关注点 应用场景
FIMO 精确查找 motif 出现位置 motif 在什么位置出现 找出具体结合位点
AME 统计 motif 富集情况 哪些 motif 在某组序列中更富集 比较 motif 是否显著出现更多
如你还在做差异表达后的RBP富集分析,可以考虑先用 FIMO 扫描,再用你自己写的代码 + Fisher’s exact test 做类似 AME 的工作,或直接用 AME 做分析
# Generate the attract_human.meme inkl. Gene_name!
#python generate_named_meme.py pwm.txt attract_human.txt
python generate_attract_human_meme.py pwm.txt ATtRACT_db.txt
#ERROR during running ame --> DEBUG!
#--control ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_all.3UTR.fasta \
ame --control --shuffle-- \
--oc ame_out \
--scoring avg \
--method fisher --verbose 5 ../Data_RNA-Seq_MKL-1+WaGa/motif_analysis/WaGa_down.3UTR.fasta attract_human.meme
2. GraphProt2 (ALTERNATIVE_TODO)
ML-based tool using sequence + structure
Pre-trained models for many RBPs
✅ Advantages:
Local, GPU/CPU supported
More biologically realistic (includes structure)
miRNAs motif analysis using ATtRACT + FIMO
✅ Goal
* Extract their sequences
* Generate a background set
* Run RBP enrichment (e.g., with RBPmap or FIMO)
* Get p-adjusted enrichment stats (e.g., Fisher + BH)
Input_1. DE results (differential expression file from smallRNA-seq)
#Input: up- and down-, all-regulated files
#~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-up.txt #83
#~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-down.txt #34
#~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/EV_vs_parental-all.txt #1304
~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-up.txt #66
~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-down.txt #38
~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-all.txt #1304
#Format: 1st column = miRNA ID (e.g., hsa-miR-21-5p), optionally with other stats.
Input_2. Reference FASTA (Reference sequences from miRBase or GENCODE)
#From miRBase: https://mirbase.org/download/ https://mirbase.org/download/CURRENT/
##miRBase_v21
#mature.fa.gz → contains mature miRNA sequences
#hairpin.fa.gz → for pre-miRNAs
cp ~/DATA/Data_Ute/Data_Ute_smallRNA_7/summaries_exo7/miRNAs/untreated_vs_parental_cells-*.txt .
#"hsa-miR-3180|hsa-miR-3180-3p"
#>hsa-miR-3180 MIMAT0018178 Homo sapiens miR-3180
#UGGGGCGGAGCUUCCGGAG
#>hsa-miR-3180-3p MIMAT0015058 Homo sapiens miR-3180-3p
#UGGGGCGGAGCUUCCGGAGGCC
5.1 (Optional, not used!)
#python extract_miRNA_fasta.py EV_vs_parental-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt #84+0
#python extract_miRNA_fasta.py EV_vs_parental-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt #0
#python extract_miRNA_fasta.py EV_vs_parental-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt #34+0
#python extract_miRNA_fasta.py EV_vs_parental-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt #0
#python extract_miRNA_fasta.py EV_vs_parental-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt #1304+16
#python extract_miRNA_fasta.py EV_vs_parental-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt #16
python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt mature_v21.fa up_mature_miRNAs.fa --unmatched up_mature_unmatched.txt #67+0
python extract_miRNA_fasta.py untreated_vs_parental_cells-up.txt hairpin_v21.fa up_precursor_miRNAs.fa --unmatched up_precursor_unmatched.txt #0
python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt mature_v21.fa down_mature_miRNAs.fa --unmatched down_mature_unmatched.txt #38+0
python extract_miRNA_fasta.py untreated_vs_parental_cells-down.txt hairpin_v21.fa down_precursor_miRNAs.fa --unmatched down_precursor_unmatched.txt #0
python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt mature_v21.fa all_mature_miRNAs.fa --unmatched all_mature_unmatched.txt #1304+16
python extract_miRNA_fasta.py untreated_vs_parental_cells-all.txt hairpin_v21.fa all_precursor_miRNAs.fa --unmatched all_precursor_unmatched.txt #16
5.2 (Advanced)
Extract Sequences + Background Set
Inputs:
* up_miRNA.txt and down_miRNA.txt: DE results (first column = miRNA name, e.g., hsa-miR-21-5p)
* mature.fa or hairpin.fa from miRBase
Outputs:
* mirna_up.fa
* mirna_down.fa
* mirna_background.fa
#Use all remaining miRNAs as background:
python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna --full-background
mv mirna_background.fa mirna_full-background.fa
#Use random subset background. Note that the generated background has the number of maxsize(up, down), in the case is up (84 records):
python prepare_miRNA_sets.py untreated_vs_parental_cells-up.txt untreated_vs_parental_cells-down.txt mature_v21.fa mirna
# grep ">" mature_v21.fa | wc -l #35828
# grep ">" mirna_full-background.fa | wc -l #35710-->35723
# grep ">" mirna_up.fa | wc -l #84
# grep ">" mirna_down.fa | wc -l #34
# grep ">" mirna_background.fa | wc -l #84-->67
# #35,710 + 84 + 34 = 35,828
🔬 What You Can Do Next
Goal Tool Input
* RBP motif enrichment in pre-miRNAs RBPmap, FIMO, AME up_precursor_miRNAs.fa
* Motif comparison (up vs down miRNAs) DREME, MEME, HOMER Up/down mature miRNAs
* Build background for enrichment Random subset of other miRNAs Filtered from hairpin.fa
fimo --thresh 1e-4 --oc fimo_mirna_down attract_human.meme mirna_down.fa
fimo --thresh 1e-4 --oc fimo_mirna_up attract_human.meme mirna_up.fa
fimo --thresh 1e-4 --oc fimo_mirna_full-background attract_human.meme mirna_full-background.fa
fimo --thresh 1e-4 --oc fimo_mirna_background attract_human.meme mirna_background.fa
#END
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_down/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_down/fimo.filtered.tsv \
--output_annotated fimo_mirna_down/fimo.filtered.annotated.tsv
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_up/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_up/fimo.filtered.tsv \
--output_annotated fimo_mirna_up/fimo.filtered.annotated.tsv
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_full-background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_full-background/fimo.filtered.tsv \
--output_annotated fimo_mirna_full-background/fimo.filtered.annotated.tsv
python filter_fimo_best_per_gene_annotated.py \
--input fimo_mirna_background/fimo.tsv \
--annot Homo_sapiens.GRCh38.gene_annotation.tsv \
--output_filtered fimo_mirna_background/fimo.filtered.tsv \
--output_annotated fimo_mirna_background/fimo.filtered.annotated.tsv
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_mirna_up/fimo.filtered.tsv \
--fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
--output rbp_enrichment_mirna_up_on_full-background.csv \
--strategy inclusive
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_mirna_down/fimo.filtered.tsv \
--fimo_bg fimo_mirna_full-background/fimo.filtered.tsv \
--output rbp_enrichment_mirna_down_on_full-background.csv \
--strategy inclusive
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_mirna_up/fimo.filtered.tsv \
--fimo_bg fimo_mirna_background/fimo.filtered.tsv \
--output rbp_enrichment_mirna_up_on_subset-background.csv \
--strategy inclusive
python run_enrichment.py \
--attract ATtRACT_db.txt \
--fimo_fg fimo_mirna_down/fimo.filtered.tsv \
--fimo_bg fimo_mirna_background/fimo.filtered.tsv \
--output rbp_enrichment_mirna_down_on_subset-background.csv \
--strategy inclusive
✅ RBP Enrichment from RBPmap Results
🔹 Use RBPmap output (typically CSV or TSV)
🔹 Compare hit counts in input vs background
🔹 Perform Fisher's exact test + Benjamini-Hochberg correction
🔹 Plot significantly enriched RBPs
📁 Requirements
You’ll need:
File Description
rbpmap_up.tsv RBPmap result file for upregulated set
rbpmap_background.tsv RBPmap result file for background set
📝 These should have columns like:
Motif Name or Protein
Sequence Name or Sequence ID
(If different, I’ll show you how to adjust.
python analyze_rbpmap_enrichment.py rbpmap_up.tsv rbpmap_background.tsv enriched_up.csv enriched_up_plot.png
✅ Output
enriched_up.csv
RBP FG_hits BG_hits pval padj enriched
ELAVL1 24 2 0.0001 0.003 ✅
HNRNPA1 15 10 0.04 0.06 ❌
enriched_up_plot.png
Barplot showing top significant RBPs (lowest FDR)
🧰 Customization Options
Would you like:
* Support for multiple RBPmap files at once?
* To match by RBP family?
* A full report (PDF/HTML) of top hits?
* Let me know, and I’ll tailor the next script!
The planned pipeline is very reasonable and scientifically sound for RBP (RNA-binding protein) enrichment using miRNAs as input for FIMO motif scanning and downstream analysis.
Here’s a breakdown of why your steps make sense, plus a few important suggestions:
✅ YOUR PIPELINE LOGIC (EXCELLENT)
1. Motif Scanning with FIMO
fimo --thresh 1e-4 --oc fimo_mirna_* attract_human.meme mirna_*.fa
✔️ Uses ATtRACT motifs to scan miRNA sequences.
✔️ Applies a reasonable p-value threshold (1e-4) to detect motif hits.
📌 If you want more sensitivity (but more noise), you could try 1e-3; for higher stringency, 1e-5.
2. Filter FIMO hits to best per gene and annotate
python filter_fimo_best_per_gene_annotated.py --input ... --annot ... --output ...
✔️ Filters redundant hits.
✔️ Annotates with gene symbols or other metadata — essential for interpretability.
3. RBP Enrichment
python run_enrichment.py --fimo_fg ... --fimo_bg ... --attract ... --output ...
✔️ Uses foreground (up/down-regulated) and background (non-differential) miRNAs.
✔️ The --strategy inclusive is typical and includes all motif hits, not just strongest per gene or region.
✔️ Running up and down separately = ✅ best practice.
🔍 MINOR IMPROVEMENTS / SUGGESTIONS
✅ Use the same background for both up and down analyses:
You already are — good!
This ensures your enrichment comparison is balanced and interpretable.
⚠️ Check motif directionality (optional):
Some RBPs are strand-specific — but since miRNAs are single-stranded, and your FASTA is from mature sequences, you’re probably okay.
✅ Consider including summary stats:
How many motifs hit in up/down/background?
How many unique RBPs?
This helps interpret enrichment p-values.
🧪 Statistical validation (optional but helpful):
If you have many miRNAs (e.g. >30), consider bootstrapping enrichment from random subsets of the background to assess robustness.
✅ FINAL VERDICT
Your pipeline is:
✔️ Technically correct
✔️ Biologically appropriate
✔️ Reproducible and clear
🧠 Well-structured for miRNA-RBP motif enrichment
Why background differs between RNA-seq and miRNA motif analysis:
In RNA-seq differential expression analysis, especially for gene-level enrichment (e.g., GO or pathway analysis), the background typically includes all genes detected or expressed, not just a random subset. The logic is:
✅ “If a gene was testable (i.e., expressed above a threshold), it should be part of the universe we consider when evaluating enrichment.”
This is because the statistical model assumes:
Your universe is all genes tested.
You're asking: Are my differentially expressed genes overrepresented in some category, compared to all testable genes?
🔬 Why a random subset background was used in the miRNA motif pipeline (e.g., RBP enrichment):
In the prepare_miRNA_sets.py script, the subset was used for one reason only:
📉 To balance the number of background sequences relative to the foreground (up/down) — especially for statistical tests like Fisher's exact test.
But this is optional and not required for correctness. A larger background improves statistical power, but must match distributional properties (e.g., GC content, sequence length) to avoid bias.
✅ So: Should you use all non-differentially expressed miRNAs as background?
Yes, that is often more appropriate, assuming:
You have a full set of detectable/testable miRNAs (e.g., all from the FASTA).
You exclude the foreground (up/down) sets.
The background matches the general properties of the foreground set.
RBP enrichments via FIMO (Another description; it is the same to the workflow in the point 4)
1. Collect the 3′ UTR sequences: Use the 3UTR.fasta file generated earlier, filtered to protein-coding and downregulated genes.
2. Prepare Motif Database (MEME format)
* ATtRACT: https://attract.cnic.es
* RBPDB: http://rbpdb.ccbr.utoronto.ca
* Ray2013 (CISBP-RNA motifs) — available via MEME Suite
* [RBPmap motifs (if downloadable)]
#Example format: rbp_motifs.meme
2. Run FIMO to Scan for RBP Motifs (Similar to RBPmap)
fimo --oc fimo_up rbp_motifs.meme mirna_up.fa
fimo --oc fimo_down rbp_motifs.meme mirna_down.fa
fimo --oc fimo_background rbp_motifs.meme mirna_background.fa
#This produces fimo.tsv in each output folder.
3. Run RBP motif enrichment using MEME Suite using AME (Analysis of Motif Enrichment). Note that FIMO+run_enrichment.py=AME, however, directly using AME returns ERROR:
ame \
--control control_3UTRs.fasta \
--oc ame_out \
--scoring avg \
--method fisher \
3UTR.fasta \
rbp_motifs.meme
Where:
* 3UTR.fasta = your downregulated genes’ 3′ UTRs
* control_3UTRs.fasta = background UTRs (e.g., random protein-coding genes not downregulated)
* rbp_motifs.meme = motif file from RBPDB or Ray2013
4. Interpret Results: Output includes RBP motifs enriched in your downregulated mRNAs' 3′ UTRs.
You can then link enriched RBPs to known interactions with your upregulated miRNAs, or explore their regulatory roles.
5. ✅ Bonus: Predict Which mRNAs Are Targets of Your miRNAs
Use tools like: miRanda, TargetScan, miRDB
Then intersect predicted targets with your downregulated genes to identify likely functional interactions.
6. Summary
Goal Input Tool / Approach
RBP enrichment 3UTR.fasta of downregulated genes AME with RBP motifs
Background/control 3′ UTRs from non-differential or upregulated genes
Link miRNA to targets Use TargetScan / miRanda Intersect with down genes
7. Would you like:
* Ready-to-use RBP motif .meme file?
* Script to generate background sequences?
* Visualization options for the enrichment results?
Other options to get sequences of 3UTR, 5UTR, CDS and mRNA transcripts
✅ Option 2: Use Ensembl BioMart (web-based, no coding) --> Lasting too long!
Go to Ensembl BioMart https://www.ensembl.org/biomart/martview/7b826bcbd0cec79021977f8dc12a8f61
Select:
Database: Ensembl Genes
Dataset: Homo sapiens genes (GRCh38 or latest)
Click on “Filters” → expand Region or Gene to limit your selection (optional).
Click on “Attributes”:
Under Sequences, check:
Sequences
3' UTR sequences
Optionally add gene IDs, transcript IDs, etc.
Click “Results” to view/download the FASTA of 3' UTRs.
✅ Option 3: Use GENCODE (precompiled annotations) and gffread
Use a tool like gffread (from the Cufflinks or gffread package) to extract 3' UTRs:
#gffread gencode.v44.annotation.gtf -g GRCh38.primary_assembly.genome.fa -w all_utrs.fa -U
#gffread -w three_prime_utrs.fa -g GRCh38.fa -x cds.fa -y proteins.fa -U -F gencode.gtf
grep -P "\tthree_prime_utr\t" gencode.v48.annotation.gtf > three_prime_utrs.gtf
gtf2bed < three_prime_utrs.gtf > three_prime_utrs.bed
bedtools getfasta -fi GRCh38.primary_assembly.genome.fa -bed three_prime_utrs.bed -name -s > three_prime_utrs.fa
gffread gencode.v48.annotation.gtf -g GRCh38.primary_assembly.genome.fa -U -w all_with_utrs.fa
Add -U flag to extract UTRs, and filter post hoc for only 3' UTRs if needed.
✅ Option 4: Use Bioconductor in R (UCSC-ID, not suitable!)
# Install if not already installed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicFeatures")
BiocManager::install("txdbmaker")
#sudo apt-get update
#sudo apt-get install libmariadb-dev
#(optional)sudo apt-get install libmysqlclient-dev
install.packages("RMariaDB")
# Load library
library(GenomicFeatures)
# Create TxDb object for human genome
txdb <- txdbmaker::makeTxDbFromUCSC(genome="hg38", tablename="refGene")
# Extract 3' UTRs by transcript
utr3 <- threeUTRsByTranscript(txdb, use.names=TRUE)
# View or export as needed
✅ Option 5: Extract 3′ UTRs Using UCSC Table Browser (GUI method)
🔗 Website:
UCSC Table Browser
🔹 Step-by-Step Instructions
1. Set the basic parameters:
Clade: Mammal
Genome: Human
Assembly: GRCh38/hg38
Group: Genes and Gene Predictions
Track: GENCODE v44 (or latest)
Table: knownGene or wgEncodeGencodeBasicV44
Choose knownGene for RefSeq-like models or wgEncodeGencodeBasicV44 for GENCODE
2. Region:
Select: genome (default)
3. Output format:
Select: sequence
4. Click "get output"
🔹 Sequence Retrieval Options:
On the next page (after clicking "get output"), you’ll see sequence options.
Configure as follows:
✅ Output format: FASTA
✅ Which part of the gene: Select only
→ UTRs → 3' UTR only
✅ Header options: choose if you want gene name,
⚡️ Bonus: Combine with miRNA-mRNA predictions
Once you have RBPs enriched in downregulated mRNAs, you can intersect:
* Which RBPs overlap miRNA binding regions (e.g., via CLIPdb or POSTAR)
* Check if miRNAs and RBPs compete or co-bind
This can lead to identifying miRNA-RBP regulatory modules.
点赞本文的读者
还没有人对此文章表态
没有评论
Analysis of the RNA binding protein (RBP) motifs for RNA-Seq and miRNAs (v2)
Analysis of the RNA binding protein motifs for RNA Seq
© 2023 XGenes.com Impressum