BioPerl_Tutorial(1).pdf

(196 KB) Pobierz

ERL

UTORIAL

(

ABREV

1) Simple script to get a sequence by Id and write to specified format

use Bio::Perl;

# this script will only work if you have an internet connection on the

# computer you're using, the databases you can get sequences from

# are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq'

$seq_object = get_sequence('genbank',"ROA1_HUMAN");

write_sequence(">roa1.fasta",'fasta',$seq_object);

Getting started in Bio::Perl

That second argument of

write_sequence

'fasta',

is the sequence format. You

can choose among all the formats supported by SeqIO. Try writing the sequence file

genbank

format.

2) Another example is the ability to blast a sequence using the facilities as NCBI. Please

be careful not to abuse the resources that NCBI provides and use this only for

individual searches. If you want to do a large number of BLAST searches, please

download the blast package and install it locally.

use Bio::Perl;

$seq = get_sequence('genbank',"ROA1_HUMAN");

# uses the default database - nr in this case

$blast_result = blast_sequence($seq);

write_blast(">roa1.blast",$blast_result);

Bio::Perl has a limited number of functions to retrieve and manipulate sequence

data. (Bio::Perl manpage)

Sequence objects

(Seq, PrimarySeq, RichSeq, LargeSeq, LocatableSeq, RelSegment, LiveSeq,

SeqWithQuality, SeqI)

This section describes various Bioperl sequence objects. Typically you don’t need to

know the type of Sequence object because SeqIO assesses and creates the right type

of object when given a file, filehandle or string.

Seq

The central sequence object in bioperl. When in doubt this is probably the object

that you want to use to describe a DNA, RNA or protein sequence in bioperl. Most

common sequence manipulations can be performed with Seq. (Bio::Seq manpage).

Seq objects may be created for you automatically when you read in a file containing

sequence data using the SeqIO object (see below). In addition to storing its

identification labels and the sequence itself, a Seq object can store multiple

annotations and associated ``sequence features'', such as those contained in most

Genbank and EMBL sequence files. This capability can be very useful -‐ especially in

development of automated genome annotation systems.

PrimarySeq

A stripped-‐down version of Seq. It contains just the sequence data itself and a few

identifying labels (id, accession number, alphabet = dna, rna, or protein), and no

features. For applications with hundreds or thousands or sequences, using

PrimarySeq objects can significantly speed up program execution and decrease the

amount of RAM the program requires. (Bio::PrimarySeq manpage).

RichSeq

Stores additional annotations beyond those used by standard Seq objects. If you are

using sources with very rich sequence annotation, you may want to consider using

these objects. RichSeq objects are created automatically when Genbank, EMBL, or

Swissprot format files are read by SeqIO. (Bio::Seq::RichSeqI manpage)

LargeSeq

Used for handling very long sequences (e.g. > 100 MB). (Bio::Seq::LargeSeq

manpage).

LocatableSeq

Might be more appropriately called an ``AlignedSeq'' object. It is a Seq object which

is part of a multiple sequence alignment. It has start and end positions indicating

from where in a larger sequence it may have been extracted. It also may have gap

symbols corresponding to the alignment to which it belongs. It is used by the

alignment object SimpleAlign and other modules that use SimpleAlign objects (e.g.

AlignIO.pm, pSW.pm).

LocatableSeq objects will be made for you automatically when you create an

alignment (using pSW, Clustalw, Tcoffee, Lagan, or bl2seq) or when you input an

alignment data file using AlignIO. However if you need to input a sequence

alignment by hand (e.g. to build a SimpleAlign object), you will need to input the

sequences as LocatableSeqs. Other sources of information include the

Bio::LocatableSeq manpage, the Bio::SimpleAlign manpage, the Bio::AlignIO

manpage, and the Bio::Tools::pSW manpage.

SeqWithQuality

Used to manipulate sequences with quality data, like those produced by phred., , and

in (Bio::Seq::SeqWithQuality manpage).

RelSegment

Useful when you want to be able to manipulate the origin of the genomic coordinate

system. This situation may occur when looking at a sub-‐sequence (e.g. an exon)

which is located on a longer underlying underlying sequence such as a chromosome

or a contig. Such manipulations may be important, for example when designing a

graphical genome browser. If your code may need such a capability, look at the

documentation the Bio::DB::GFF::RelSegment manpage which describes this feature

in detail.

LiveSeq

Addresses the problem of features whose location on a sequence changes over time.

This can happen, for example, when sequence feature objects are used to store gene

locations on newly sequenced genomes -‐ locations which can change as higher

quality sequencing data becomes available. Although a LiveSeq object is not

implemented in the same way as a Seq object, LiveSeq does implement the SeqI

interface (see below). Consequently, most methods available for Seq objects will

work fine with LiveSeq objects. Section III.7.4 and the Bio::LiveSeq manpage contain

further discussion of LiveSeq objects.

SeqI

Seq ``interface objects''. They are used to ensure bioperl's compatibility with other

software packages. SeqI and other interface objects are not likely to be relevant to

the casual Bioperl user. (Bio::SeqI manpage)

Accessing sequence data from local and remote databases

$seq = Bio::Seq->new(-seq

-description

-display_id

-accession_number

-alphabet

Example: create a simple Seq object. Can you make it print the accession number,

alphabet, and sequence each on a separate line

'actgtggcgtcaact',

'Sample Bio::Seq object',

'something',

'BIOL_5310',

'dna' );

In most cases, you will probably be accessing sequence data from some online data

file or database.

Accessing remote databases

Example: The following code example will get 3 different sequences from GenBank.

$gb = new Bio::DB::GenBank();

# this returns a Seq object :

$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');

# this also returns a Seq object :

$seq2 = $gb->get_Seq_by_acc('AF303112');

# this returns a SeqIO object, which can be used to get a Seq object :

$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);

$seq3 = $seqio->next_seq;

Can you make it print all the sequence descriptions?

Transforming sequence files (SeqIO)

A common -‐ and tedious -‐ bioinformatics task is that of converting sequence data

among the many widely used data formats. Bioperl's SeqIO object, however, makes

this chore a breeze. SeqIO can read a stream of sequences -‐ located in a single or in

multiple files -‐ in a number of formats including Fasta, EMBL, GenBank, Swissprot,

SCF, phd/phred, Ace, fastq, exp, chado, or raw (plain sequence). SeqIO can also parse

tracefiles in alf, ztr, abi, ctf, and ctr format Once the sequence data has been read in

with SeqIO, it is available to Bioperl in the form of Seq, PrimarySeq, or RichSeq

objects, depending on what the sequence source is. Moreover, the sequence objects

can then be written to another file using SeqIO in any of the supported data formats

making data converters simple to implement, for example:

use Bio::SeqIO;

$in = Bio::SeqIO->new(-file => "inputfilename",

-format => 'Fasta');

$out = Bio::SeqIO->new(-file => ">outputfilename",

-format => 'EMBL');

while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }

In addition, the Perl ``tied filehandle'' syntax is available to SeqIO, allowing you to

use the standard <> and print operations to read and write sequence objects, eg:

$in

= Bio::SeqIO->newFh(-file => "inputfilename" ,

-format => 'fasta');

$out = Bio::SeqIO->newFh(-format => 'embl');

print $out $_ while <$in>;

If the ``-‐format'' argument isn't used then Bioperl will try to determine the format based

on the file's suffix, in a case-‐insensitive manner. If there's no suffix available then SeqIO

will attempt to guess the format based on actual content. If it can't determine the format

then it will assume ``fasta''. A complete list of formats and suffixes can be found in the

SeqIO HOWTO (http://www.bioperl.org/wiki/HOWTO:SeqIO).

Practice: Get the sequences AJ312413, NP_001073624, XM_001807534 with accession

number from genbank and write them to a file in fasta format.

Transforming alignment files (AlignIO)

Data files storing multiple sequence alignments also appear in varied formats. AlignIO is

the Bioperl object for conversion of alignment files. AlignIO is patterned on the SeqIO

object and its commands have many of the same names as the commands in SeqIO. Just as

in SeqIO the AlignIO object can be created with ``-‐file'' and ``-‐format'' options:

use Bio::AlignIO;

my $io = Bio::AlignIO->new(-file

=> "example.aln",

-format => "clustalw" );

If the ``-‐format'' argument isn't used then Bioperl will try and determine the format based

on the file's suffix, in a case-‐insensitive manner. Here is the current set of suffixes:

Format

bl2seq

clustalw

emboss*

fasta

maf

mase

mega

meme

metafasta

msf

nexus

pfam

phylip

prodom

psi

selex

stockholm

Suffixes

aln

water|needle

fasta|fast|seq|fa|fsa|nt|aa

maf

Seaview

meg|mega

meme

msf|pileup|gcg

nexus|nex

pfam|pfm

phylip|phlp|phyl|phy|phy|ph

psi

selex|slx|selx|slex|sx

GCG

interleaved

POA

PSI-BLAST

HMMER

Comment

*water, needle, matcher, stretcher, merger, and supermatcher.

Unlike SeqIO, AlignIO cannot create output files in every format. AlignIO currently

supports output in these 7 formats:

fasta, mase, selex, clustalw, msf/gcg, phylip

(interleaved), and po

Another significant difference between AlignIO and SeqIO is that AlignIO handles IO for

only a single alignment at a time but SeqIO.pm handles IO for multiple sequences in a

single stream. Syntax for AlignIO is almost identical to that of SeqIO:

use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "inputfilename" ,

-format => 'clustalw');

$out = Bio::AlignIO->new(-file => ">outputfilename",

-format => 'fasta');

while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }

The only difference is that the returned object reference, $aln, is to a SimpleAlign object

BioPerl_Tutorial(1).pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: