Tutorial
RNAmediator_plfold
Using Genes
ConstraintPLFold.py can be used for folding a list of sequences with constraints. Therefore it is recommended to use one fasta file containing sequences and a corresponding bed file containing the constraints. Matching between sequences and constraints is done via the gene identifier (here ENSG00000270742) which has to be the same in the bed file (Column 4)and the fasta header and should look as follows:
FASTA
>ENSG00000270742:chr1:61124732-61125202(+)
TTTTTTCTTTATAATTATTCCCCTATTTGAAAAATCAACTTGTATATGAGGCAGCAAACACCTTGCAGAGC...
BED
chr1 26 35 ENSG00000270742 . +
A FASTA File with header following this format can easily be generated using getfasta from the BedTools suite on a gene coordinates BED file with options -name and -full-header enabled.
The command line call for this simple scenario is supposed to look like this, where window, span and ulim are options for RNAplfold and will be used to collect and summarize results in the the RNAmediator_collect_plfold step:
RNAmediator_plfold -s FASTA -x BedFile --window SizeOfWindowToFold --span BasepairSpan -u ulim
By default all results are printed to stdout. The first output is the folded sequence without constraints. This output can be redirected to a file using --unconstraint nameforunconstraint. The following outputs are the matrices for an unpaired constraint and for an paired constraint. These matrices can also be redirected to a file using --paired nameforpaired or --unpaired nameforunpaired. The folder to which these files should be saved is determined via the --outdir flag.
Constraints can also be passed using the --ono (one on one) flag like this:
python ConstraintPLFold.py -s FASTA -x ono,BedFile
Matching between constraints and sequences consequently ignores the gene identifiers and instead the first sequence will use the first line in the bed file as a constraint.
If your constraints are specified using genomic coordinates, it is essential to provide another bed file with genomic coordinates.
This file can be passed using the --genes GeneBedFile and should look like:
GeneBED
chr1 61124732 61125202 ENSG00000270742 . +
Folding without constraint
In case you simply want to fold sequences without applying any constraints you can use the scanning mode of RNAmediator_plfold which will only create --unconstraint nameforunconstraint ouput as follows:
RNAmediator_plfold -s FASTA -x scanning --window SizeOfWindowToFold --span BasepairSpan -u ulim
Folding with sliding window constraint
In case you want to fold sequences without specific constraint positions you can use the sliding mode of RNAmediator_plfold (DEFAULT) which will generate a list of constraints covering every nucleotide of the input sequence and fold each independently. This will take long and produce large amounts of output, so make sure this is really what you want. It does, however, provide one with the opportunity to scan for regions with largest impact on structure along a given sequence, e.g. to identify potential miRNA or antagonist binding sites:
RNAmediator_plfold -s FASTA -x sliding --window SizeOfWindowToFold --span BasepairSpan -u ulim
Using Transcripts or single gene features
If you plan to use transcript isoforms instead of genes it is highly recommended to change the header of your fasta sequences as well as providing local instead of genomic coordinates for constraints in the bed file.
If using single exon or intron sequences, make sure they have unique identifiers (e.g. ENST000000001.5_intron1, ENST000000001.5_intron2 etc.) in both files, using just the transcript ID is not enough to map constraints to the respective sequences in that case.
For transcripts this can look as follows:
FASTA
>ENST00000240304.5::ENST00000240304.5:0-5482(.)
GUCUUGUCGGCUCCUGUGUGUAGGAGGGAUUUCGGCCUGAGAGCGGG...
BED
ENST00000240304.5 1178 1183 ENST00000240304.5 . .
In this case you should ALWAYS provide a ‘local’ GeneBed file (--genes) which looks quite similar to the Constraints File but genes start at position 0 and end at the length of the sequence. It is not important to use this file in the Rissmed_plfold call.
However, it is essential for the RNAmediator_collect_plfold step to generate valid genomic coordinate BED files as output.
GeneBED
ENST00000240304.5 0 5482 ENST00000240304.5 . .
RNAmediator_collect_plfold
The methods mentioned in the RNAmediator_plfold example will produce output that can be processed by CollectConsResults.py. This will generate BED files storing the probability of being unpaired for spans of nucleotides around (not overlapping) the constraint. Make sure to provide a comma separated pattern for window and span as used in the RNAmediator_plfold call (here SizeOfWindowToFold,BasepairSpan) and ulim either similar or lower than in the RNAmediator_plfold call. Therefore simply call:
Rissmed_collect_plfold -d path/to/ConstraintPLFold/output -g GeneBed --outdir path/to/outdir --pattern SizeOfWindowToFold,BasepairSpan -u ulim
Note
The collection step will only work if you provide coordinates of the gene or transcript via a GeneBED file.
The -u parameter defines the span sizes that are used for the output BED files which might
for example look like this for -u 5:
chr1 110135760 110135765 ENSG00000065135|44542-44544|110135774-110135777 0.02620831 + 9 0.05007846 0.07628677 2.2444807817789587 0.02620831000000001 0.04615814895706037 0.15227569
chr1 110135761 110135766 ENSG00000065135|44542-44544|110135774-110135777 0.019514269999999993 + 8 0.049938 0.06945227 2.426255722126518 0.019514269999999993 -0.03666320494171017 0.15227569
chr1 110135762 110135767 ENSG00000065135|44542-44544|110135774-110135777 0.09578122 + 7 0.13804191 0.23382313 1.4457214540425338 0.09578122 0.9069421599395611 0.15227569
chr1 110135763 110135768 ENSG00000065135|44542-44544|110135774-110135777 0.07929997 + 6 0.06154513 0.1408451 1.5621026141257632 0.07929996999999998 0.7030295094408919 0.15227569
chr1 110135778 110135783 ENSG00000065135|44542-44544|110135774-110135777 0.31023007 + -7 0.14063798 0.45086805 0.7213795423617754 0.31023007 3.560189541047878 0.15227569
chr1 110135779 110135784 ENSG00000065135|44542-44544|110135774-110135777 0.2991335 + -8 0.1657577 0.4648912 0.7438289320089759 0.2991335 3.4228983161623856 0.15227569
chr1 110135780 110135785 ENSG00000065135|44542-44544|110135774-110135777 0.29541885 + -9 0.15833754 0.45375639 0.7515304743397738 0.29541885 3.376939173064934 0.15227569
Note
If you used different plfold parameters (-w, -l) in the RNAmediator_plfold call, you have to adapt the pattern accordingly
Output
RNAmediator_plfold
Per default, RNAmediator_plfold prints to STDOUT, or dumps numpy arrays to disk following the naming provided by the user with the --unconstraint nameforunconstraint, --paired nameforpaired and --unpaired nameforunpaired options. These numpy arrays are then used as input for RNAmediator_collect_plfold.
If the user prefers to also generate human readable files similar to RNAplfold output, the option –save 1 has to be set. This will generate ‘.gz’ files providing the same output as a commandline call to RNAplfold.
The standard user will not want to work on this output directly but generate genomic coordinate BED files, summarizing the effect of ligand binding as will be explained next.
RNAmediator_collect_plfold
The genomic coordinate BED file generated by RNAmediator_collect_plfold contains the following columns:
Chr Start End Constraint Accessibility_difference Strand Distance_to_constraint Accessibility_no_constraint Accessibility_constraint Energy_Difference Kd_change Zscore Accessibility_constraint_pos
Note
The first 6 fields follow standard BED format, with the Difference in Accessibility in Column 5 and the constraint as well as its local and genomic coordinates as Identifier in Column4. Distance_to_constraint shows the distance between the region that changed and the applied constraint Accessibility_no_constraint is the accessibility of this region before the constraint was applied Accessibility_constraint is the accessibility of this region after the constraint was applied Energy_Difference is the change in ‘Free Energy’ of the structure after applying the constraint Kd_change is the influence on the Kd of a potential binding partner after the constraint has been applied Zscore is the Zscore of accessibility changes in this position after the constraint has been applied in comparison to all changes in all positions Accessibility_constraint_pos shows the accessibility at the position that was constraint before the constraint was applied
Further steps
The BED file created by CollectConsResults can be used to intersect with other known binding sites on the same gene/transcript. Thus, it is possible to see whether the changes in RNA structure upon binding of one ligand might affect the structure of binding site of other ligands.
RNAtweaks
RNAplfold
For RNAplfold usage two different wrappers exist. One uses the command line version of RNAplfold and the other uses the ViennaRNA API
from RNAmediator.RNAtweaks import RNAtweaks
sequence = "AAATTTTGGGGGGCCCC"
window = 3 # winsize option of RNAplfold
span = 3 # span option of RNAplfold
region = 3 # ulength option of RNAplfold
constraint = ("paired", 3, 5)
api_result = RNAtweaks.api_rnaplfold(sequence, window, span, region=region, temperature=37.0, constraint=[constraint])
cmd_result = RNAtweaks.cmd_rnaplfold(sequence, window, span, region=region, temperature=37.0, constraint=[constraint])
For now only paired and unpaired constraints are supported. The constraints must be a list of Tuples in the format ("paired"/"unpaired", start, end) in contrast to the ViennaRNA constraints these are zero based.
Both calls will produce an identical PLFoldOutput object that reflects the ViennaRNA _lunp file.
PLFoldOutput
Object that reflects the ViennaRNA _lunp file. The objects supports various functions to get different representations of the file. The recommended usage is to produce an numpy array as follows:
array = api_result.numpy_array
However, it is also possible to get the text representation of the file, which is usually produced by RNAplfold via:
array = api_result.get_text(nan="NA", truncated=False)
Hereby nan replaced the non float values with "NA" and the truncated flag is used to either keep or drop the header lines beginning with “#”.