BioPerl_Tutorial(1).pdf
(
196 KB
)
Pobierz
B
IO
P
ERL
T
UTORIAL
(
ABREV
.)
1) Simple script to get a sequence by Id and write to specified format
use Bio::Perl;
# this script will only work if you have an internet connection on the
# computer you're using, the databases you can get sequences from
# are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq'
$seq_object = get_sequence('genbank',"ROA1_HUMAN");
write_sequence(">roa1.fasta",'fasta',$seq_object);
Getting started in Bio::Perl
That second argument of
write_sequence
,
'fasta',
is the sequence format. You
can choose among all the formats supported by SeqIO. Try writing the sequence file
in
'
genbank
'
format.
2) Another example is the ability to blast a sequence using the facilities as NCBI. Please
be careful not to abuse the resources that NCBI provides and use this only for
individual searches. If you want to do a large number of BLAST searches, please
download the blast package and install it locally.
use Bio::Perl;
$seq = get_sequence('genbank',"ROA1_HUMAN");
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_result);
Bio::Perl has a limited number of functions to retrieve and manipulate sequence
data. (Bio::Perl manpage)
Sequence objects
(Seq, PrimarySeq, RichSeq, LargeSeq, LocatableSeq, RelSegment, LiveSeq,
SeqWithQuality, SeqI)
This section describes various Bioperl sequence objects. Typically you don’t need to
know the type of Sequence object because SeqIO assesses and creates the right type
of object when given a file, filehandle or string.
Seq
The central sequence object in bioperl. When in doubt this is probably the object
that you want to use to describe a DNA, RNA or protein sequence in bioperl. Most
common sequence manipulations can be performed with Seq. (Bio::Seq manpage).
Seq objects may be created for you automatically when you read in a file containing
sequence data using the SeqIO object (see below). In addition to storing its
identification labels and the sequence itself, a Seq object can store multiple
annotations and associated ``sequence features'', such as those contained in most
Genbank and EMBL sequence files. This capability can be very useful -‐ especially in
development of automated genome annotation systems.
PrimarySeq
A stripped-‐down version of Seq. It contains just the sequence data itself and a few
identifying labels (id, accession number, alphabet = dna, rna, or protein), and no
features. For applications with hundreds or thousands or sequences, using
PrimarySeq objects can significantly speed up program execution and decrease the
amount of RAM the program requires. (Bio::PrimarySeq manpage).
RichSeq
Stores additional annotations beyond those used by standard Seq objects. If you are
using sources with very rich sequence annotation, you may want to consider using
these objects. RichSeq objects are created automatically when Genbank, EMBL, or
Swissprot format files are read by SeqIO. (Bio::Seq::RichSeqI manpage)
LargeSeq
Used for handling very long sequences (e.g. > 100 MB). (Bio::Seq::LargeSeq
manpage).
LocatableSeq
Might be more appropriately called an ``AlignedSeq'' object. It is a Seq object which
is part of a multiple sequence alignment. It has start and end positions indicating
from where in a larger sequence it may have been extracted. It also may have gap
symbols corresponding to the alignment to which it belongs. It is used by the
alignment object SimpleAlign and other modules that use SimpleAlign objects (e.g.
AlignIO.pm, pSW.pm).
LocatableSeq objects will be made for you automatically when you create an
alignment (using pSW, Clustalw, Tcoffee, Lagan, or bl2seq) or when you input an
alignment data file using AlignIO. However if you need to input a sequence
alignment by hand (e.g. to build a SimpleAlign object), you will need to input the
sequences as LocatableSeqs. Other sources of information include the
Bio::LocatableSeq manpage, the Bio::SimpleAlign manpage, the Bio::AlignIO
manpage, and the Bio::Tools::pSW manpage.
SeqWithQuality
Used to manipulate sequences with quality data, like those produced by phred., , and
in (Bio::Seq::SeqWithQuality manpage).
RelSegment
Useful when you want to be able to manipulate the origin of the genomic coordinate
system. This situation may occur when looking at a sub-‐sequence (e.g. an exon)
which is located on a longer underlying underlying sequence such as a chromosome
or a contig. Such manipulations may be important, for example when designing a
graphical genome browser. If your code may need such a capability, look at the
documentation the Bio::DB::GFF::RelSegment manpage which describes this feature
in detail.
LiveSeq
Addresses the problem of features whose location on a sequence changes over time.
This can happen, for example, when sequence feature objects are used to store gene
locations on newly sequenced genomes -‐ locations which can change as higher
quality sequencing data becomes available. Although a LiveSeq object is not
implemented in the same way as a Seq object, LiveSeq does implement the SeqI
interface (see below). Consequently, most methods available for Seq objects will
work fine with LiveSeq objects. Section III.7.4 and the Bio::LiveSeq manpage contain
further discussion of LiveSeq objects.
SeqI
Seq ``interface objects''. They are used to ensure bioperl's compatibility with other
software packages. SeqI and other interface objects are not likely to be relevant to
the casual Bioperl user. (Bio::SeqI manpage)
Accessing sequence data from local and remote databases
$seq = Bio::Seq->new(-seq
-description
-display_id
-accession_number
-alphabet
=>
=>
=>
=>
=>
Example: create a simple Seq object. Can you make it print the accession number,
alphabet, and sequence each on a separate line
'actgtggcgtcaact',
'Sample Bio::Seq object',
'something',
'BIOL_5310',
'dna' );
In most cases, you will probably be accessing sequence data from some online data
file or database.
Accessing remote databases
Example: The following code example will get 3 different sequences from GenBank.
$gb = new Bio::DB::GenBank();
# this returns a Seq object :
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
# this also returns a Seq object :
$seq2 = $gb->get_Seq_by_acc('AF303112');
# this returns a SeqIO object, which can be used to get a Seq object :
$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);
$seq3 = $seqio->next_seq;
Can you make it print all the sequence descriptions?
Transforming sequence files (SeqIO)
A common -‐ and tedious -‐ bioinformatics task is that of converting sequence data
among the many widely used data formats. Bioperl's SeqIO object, however, makes
this chore a breeze. SeqIO can read a stream of sequences -‐ located in a single or in
multiple files -‐ in a number of formats including Fasta, EMBL, GenBank, Swissprot,
SCF, phd/phred, Ace, fastq, exp, chado, or raw (plain sequence). SeqIO can also parse
tracefiles in alf, ztr, abi, ctf, and ctr format Once the sequence data has been read in
with SeqIO, it is available to Bioperl in the form of Seq, PrimarySeq, or RichSeq
objects, depending on what the sequence source is. Moreover, the sequence objects
can then be written to another file using SeqIO in any of the supported data formats
making data converters simple to implement, for example:
use Bio::SeqIO;
$in = Bio::SeqIO->new(-file => "inputfilename",
-format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename",
-format => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
In addition, the Perl ``tied filehandle'' syntax is available to SeqIO, allowing you to
use the standard <> and print operations to read and write sequence objects, eg:
$in
= Bio::SeqIO->newFh(-file => "inputfilename" ,
-format => 'fasta');
$out = Bio::SeqIO->newFh(-format => 'embl');
print $out $_ while <$in>;
If the ``-‐format'' argument isn't used then Bioperl will try to determine the format based
on the file's suffix, in a case-‐insensitive manner. If there's no suffix available then SeqIO
will attempt to guess the format based on actual content. If it can't determine the format
then it will assume ``fasta''. A complete list of formats and suffixes can be found in the
SeqIO HOWTO (http://www.bioperl.org/wiki/HOWTO:SeqIO).
Practice: Get the sequences AJ312413, NP_001073624, XM_001807534 with accession
number from genbank and write them to a file in fasta format.
Transforming alignment files (AlignIO)
Data files storing multiple sequence alignments also appear in varied formats. AlignIO is
the Bioperl object for conversion of alignment files. AlignIO is patterned on the SeqIO
object and its commands have many of the same names as the commands in SeqIO. Just as
in SeqIO the AlignIO object can be created with ``-‐file'' and ``-‐format'' options:
use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file
=> "example.aln",
-format => "clustalw" );
If the ``-‐format'' argument isn't used then Bioperl will try and determine the format based
on the file's suffix, in a case-‐insensitive manner. Here is the current set of suffixes:
Format
bl2seq
clustalw
emboss*
fasta
maf
mase
mega
meme
metafasta
msf
nexus
pfam
phylip
po
prodom
psi
selex
stockholm
Suffixes
aln
water|needle
fasta|fast|seq|fa|fsa|nt|aa
maf
Seaview
meg|mega
meme
msf|pileup|gcg
nexus|nex
pfam|pfm
phylip|phlp|phyl|phy|phy|ph
psi
selex|slx|selx|slex|sx
GCG
interleaved
POA
PSI-BLAST
HMMER
Comment
*water, needle, matcher, stretcher, merger, and supermatcher.
Unlike SeqIO, AlignIO cannot create output files in every format. AlignIO currently
supports output in these 7 formats:
fasta, mase, selex, clustalw, msf/gcg, phylip
(interleaved), and po
.
Another significant difference between AlignIO and SeqIO is that AlignIO handles IO for
only a single alignment at a time but SeqIO.pm handles IO for multiple sequences in a
single stream. Syntax for AlignIO is almost identical to that of SeqIO:
use Bio::AlignIO;
$in = Bio::AlignIO->new(-file => "inputfilename" ,
-format => 'clustalw');
$out = Bio::AlignIO->new(-file => ">outputfilename",
-format => 'fasta');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
The only difference is that the returned object reference, $aln, is to a SimpleAlign object
rather than to a Seq object.
Plik z chomika:
xyzgeo
Inne pliki z tego folderu:
oreilly_learning_perl(1).pdf
(4640 KB)
Perl(1).rar
(86 KB)
perl_wprowadzenie(2).pdf
(440 KB)
perl_tutorial(1).pdf
(2094 KB)
perltut(1).pdf
(2032 KB)
Inne foldery tego chomika:
0
algorytmika
artykuly
bioinformatyka (biotech06)
Bioinformatyka (patryska89)
Zgłoś jeśli
naruszono regulamin