BioPerl_Tutorial.pdf

(196 KB) Pobierz
B
IO
P
ERL  
T
UTORIAL  
(
ABREV
.)  
1)  Simple  script  to  get  a  sequence  by  Id  and  write  to  specified  format  
use Bio::Perl;
# this script will only work if you have an internet connection on the
# computer you're using, the databases you can get sequences from
# are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq'
$seq_object = get_sequence('genbank',"ROA1_HUMAN");
write_sequence(">roa1.fasta",'fasta',$seq_object);
Getting  started  in  Bio::Perl  
 
That  second  argument  of  
write_sequence
,  
'fasta',  
is  the  sequence  format.  You  
can  choose  among  all  the  formats  supported  by  SeqIO.      Try  writing  the  sequence  file  
in  
'
genbank
'
format.  
2) Another  example  is  the  ability  to  blast  a  sequence  using  the  facilities  as  NCBI.  Please  
be  careful  not  to  abuse  the  resources  that  NCBI  provides  and  use  this  only  for  
individual  searches.  If  you  want  to  do  a  large  number  of  BLAST  searches,  please  
download  the  blast  package  and  install  it  locally.  
use Bio::Perl;
$seq = get_sequence('genbank',"ROA1_HUMAN");
# uses the default database - nr in this case
$blast_result = blast_sequence($seq);
write_blast(">roa1.blast",$blast_result);
Bio::Perl  has  a  limited  number  of  functions  to  retrieve  and  manipulate  sequence  
data.    (Bio::Perl  manpage)  
 
Sequence  objects
   
(Seq,  PrimarySeq,  RichSeq,  LargeSeq,  LocatableSeq,  RelSegment,  LiveSeq,  
SeqWithQuality,  SeqI)  
This  section  describes  various  Bioperl  sequence  objects.  Typically  you  don’t  need  to  
know  the  type  of  Sequence  object  because  SeqIO  assesses  and  creates  the  right  type  
of  object  when  given  a  file,  filehandle  or  string.    
Seq    
The  central  sequence  object  in  bioperl.  When  in  doubt  this  is  probably  the  object  
that  you  want  to  use  to  describe  a  DNA,  RNA  or  protein  sequence  in  bioperl.  Most  
common  sequence  manipulations  can  be  performed  with  Seq.    (Bio::Seq  manpage).  
Seq  objects  may  be  created  for  you  automatically  when  you  read  in  a  file  containing  
sequence  data  using  the  SeqIO  object  (see  below).    In  addition  to  storing  its  
identification  labels  and  the  sequence  itself,  a  Seq  object  can  store  multiple  
annotations  and  associated  ``sequence  features'',  such  as  those  contained  in  most  
Genbank  and  EMBL  sequence  files.  This  capability  can  be  very  useful  -­‐  especially  in  
development  of  automated  genome  annotation  systems.  
PrimarySeq    
A  stripped-­‐down  version  of  Seq.  It  contains  just  the  sequence  data  itself  and  a  few  
identifying  labels  (id,  accession  number,  alphabet  =  dna,  rna,  or  protein),  and  no  
features.  For  applications  with  hundreds  or  thousands  or  sequences,  using  
PrimarySeq  objects  can  significantly  speed  up  program  execution  and  decrease  the  
amount  of  RAM  the  program  requires.  (Bio::PrimarySeq  manpage).  
RichSeq
   
Stores  additional  annotations  beyond  those  used  by  standard  Seq  objects.  If  you  are  
using  sources  with  very  rich  sequence  annotation,  you  may  want  to  consider  using  
these  objects.  RichSeq  objects  are  created  automatically  when  Genbank,  EMBL,  or  
Swissprot  format  files  are  read  by  SeqIO.  (Bio::Seq::RichSeqI  manpage)  
LargeSeq    
Used  for  handling  very  long  sequences  (e.g.  >  100  MB).  (Bio::Seq::LargeSeq  
manpage).  
LocatableSeq    
Might  be  more  appropriately  called  an  ``AlignedSeq''  object.  It  is  a  Seq  object  which  
is  part  of  a  multiple  sequence  alignment.  It  has  start  and  end  positions  indicating  
from  where  in  a  larger  sequence  it  may  have  been  extracted.  It  also  may  have  gap  
symbols  corresponding  to  the  alignment  to  which  it  belongs.  It  is  used  by  the  
alignment  object  SimpleAlign  and  other  modules  that  use  SimpleAlign  objects  (e.g.  
AlignIO.pm,  pSW.pm).  
LocatableSeq  objects  will  be  made  for  you  automatically  when  you  create  an  
alignment  (using  pSW,  Clustalw,  Tcoffee,  Lagan,  or  bl2seq)  or  when  you  input  an  
alignment  data  file  using  AlignIO.  However  if  you  need  to  input  a  sequence  
alignment  by  hand  (e.g.  to  build  a  SimpleAlign  object),  you  will  need  to  input  the  
sequences  as  LocatableSeqs.  Other  sources  of  information  include  the  
Bio::LocatableSeq  manpage,  the  Bio::SimpleAlign  manpage,  the  Bio::AlignIO  
manpage,  and  the  Bio::Tools::pSW  manpage.  
SeqWithQuality    
Used  to  manipulate  sequences  with  quality  data,  like  those  produced  by  phred.,  ,  and  
in  (Bio::Seq::SeqWithQuality  manpage).  
RelSegment    
Useful  when  you  want  to  be  able  to  manipulate  the  origin  of  the  genomic  coordinate  
system.  This  situation  may  occur  when  looking  at  a  sub-­‐sequence  (e.g.  an  exon)  
which  is  located  on  a  longer  underlying  underlying  sequence  such  as  a  chromosome  
or  a  contig.  Such  manipulations  may  be  important,  for  example  when  designing  a  
graphical  genome  browser.  If  your  code  may  need  such  a  capability,  look  at  the  
documentation  the  Bio::DB::GFF::RelSegment  manpage  which  describes  this  feature  
in  detail.  
LiveSeq  
Addresses  the  problem  of  features  whose  location  on  a  sequence  changes  over  time.  
This  can  happen,  for  example,  when  sequence  feature  objects  are  used  to  store  gene  
locations  on  newly  sequenced  genomes  -­‐  locations  which  can  change  as  higher  
quality  sequencing  data  becomes  available.  Although  a  LiveSeq  object  is  not  
implemented  in  the  same  way  as  a  Seq  object,  LiveSeq  does  implement  the  SeqI  
interface  (see  below).  Consequently,  most  methods  available  for  Seq  objects  will  
work  fine  with  LiveSeq  objects.  Section  III.7.4  and  the  Bio::LiveSeq  manpage  contain  
further  discussion  of  LiveSeq  objects.  
SeqI    
Seq  ``interface  objects''.  They  are  used  to  ensure  bioperl's  compatibility  with  other  
software  packages.  SeqI  and  other  interface  objects  are  not  likely  to  be  relevant  to  
the  casual  Bioperl  user.  (Bio::SeqI  manpage)  
Accessing  sequence  data  from  local  and  remote  databases  
$seq = Bio::Seq->new(-seq
-description
-display_id
-accession_number
-alphabet
=>
=>
=>
=>
=>
Example:  create  a  simple  Seq  object.    Can  you  make  it  print  the  accession  number,  
alphabet,  and  sequence  each  on  a  separate  line  
'actgtggcgtcaact',
'Sample Bio::Seq object',
'something',
'BIOL_5310',
'dna' );
In  most  cases,  you  will  probably  be  accessing  sequence  data  from  some  online  data  
file  or  database.  
Accessing  remote  databases  
Example:  The  following  code  example  will  get  3  different  sequences  from  GenBank.      
$gb = new Bio::DB::GenBank();
# this returns a Seq object :
$seq1 = $gb->get_Seq_by_id('MUSIGHBA1');
# this also returns a Seq object :
$seq2 = $gb->get_Seq_by_acc('AF303112');
# this returns a SeqIO object, which can be used to get a Seq object :
$seqio = $gb->get_Stream_by_id(["J00522","AF303112","2981014"]);
$seq3 = $seqio->next_seq;
Can  you  make  it  print  all  the  sequence  descriptions?  
Transforming  sequence  files  (SeqIO)  
A  common  -­‐  and  tedious  -­‐  bioinformatics  task  is  that  of  converting  sequence  data  
among  the  many  widely  used  data  formats.  Bioperl's  SeqIO  object,  however,  makes  
this  chore  a  breeze.  SeqIO  can  read  a  stream  of  sequences  -­‐  located  in  a  single  or  in  
multiple  files  -­‐  in  a  number  of  formats  including  Fasta,  EMBL,  GenBank,  Swissprot,  
SCF,  phd/phred,  Ace,  fastq,  exp,  chado,  or  raw  (plain  sequence).  SeqIO  can  also  parse  
tracefiles  in  alf,  ztr,  abi,  ctf,  and  ctr  format  Once  the  sequence  data  has  been  read  in  
with  SeqIO,  it  is  available  to  Bioperl  in  the  form  of  Seq,  PrimarySeq,  or  RichSeq  
objects,  depending  on  what  the  sequence  source  is.  Moreover,  the  sequence  objects  
can  then  be  written  to  another  file  using  SeqIO  in  any  of  the  supported  data  formats  
making  data  converters  simple  to  implement,  for  example:  
use Bio::SeqIO;
$in = Bio::SeqIO->new(-file => "inputfilename",
-format => 'Fasta');
$out = Bio::SeqIO->new(-file => ">outputfilename",
-format => 'EMBL');
while ( my $seq = $in->next_seq() ) {$out->write_seq($seq); }
 
In  addition,  the  Perl  ``tied  filehandle''  syntax  is  available  to  SeqIO,  allowing  you  to  
use  the  standard  <>  and  print  operations  to  read  and  write  sequence  objects,  eg:  
   
$in
= Bio::SeqIO->newFh(-file => "inputfilename" ,
-format => 'fasta');
$out = Bio::SeqIO->newFh(-format => 'embl');
print $out $_ while <$in>;
If  the  ``-­‐format''  argument  isn't  used  then  Bioperl  will  try  to  determine  the  format  based  
on  the  file's  suffix,  in  a  case-­‐insensitive  manner.  If  there's  no  suffix  available  then  SeqIO  
will  attempt  to  guess  the  format  based  on  actual  content.  If  it  can't  determine  the  format  
then  it  will  assume  ``fasta''.  A  complete  list  of  formats  and  suffixes  can  be  found  in  the  
SeqIO  HOWTO  (http://www.bioperl.org/wiki/HOWTO:SeqIO).  
Practice:  Get  the  sequences  AJ312413,  NP_001073624,  XM_001807534  with  accession  
number  from  genbank  and  write  them  to  a  file  in  fasta  format.  
 
Transforming  alignment  files  (AlignIO)  
Data  files  storing  multiple  sequence  alignments  also  appear  in  varied  formats.  AlignIO  is  
the  Bioperl  object  for  conversion  of  alignment  files.  AlignIO  is  patterned  on  the  SeqIO  
object  and  its  commands  have  many  of  the  same  names  as  the  commands  in  SeqIO.  Just  as  
in  SeqIO  the  AlignIO  object  can  be  created  with  ``-­‐file''  and  ``-­‐format''  options:  
use Bio::AlignIO;
my $io = Bio::AlignIO->new(-file
=> "example.aln",
-format => "clustalw" );
If  the  ``-­‐format''  argument  isn't  used  then  Bioperl  will  try  and  determine  the  format  based  
on  the  file's  suffix,  in  a  case-­‐insensitive  manner.  Here  is  the  current  set  of  suffixes:  
Format
bl2seq
clustalw
emboss*
fasta
maf
mase
mega
meme
metafasta
msf
nexus
pfam
phylip
po
prodom
psi
selex
stockholm
Suffixes
aln
water|needle
fasta|fast|seq|fa|fsa|nt|aa
maf
Seaview
meg|mega
meme
msf|pileup|gcg
nexus|nex
pfam|pfm
phylip|phlp|phyl|phy|phy|ph
psi
selex|slx|selx|slex|sx
GCG
interleaved
POA
PSI-BLAST
HMMER
Comment
 
*water,  needle,  matcher,  stretcher,  merger,  and  supermatcher.    
Unlike  SeqIO,  AlignIO  cannot  create  output  files  in  every  format.  AlignIO  currently  
supports  output  in  these  7  formats:  
fasta, mase, selex, clustalw, msf/gcg, phylip
(interleaved), and po
.  
Another  significant  difference  between  AlignIO  and  SeqIO  is  that  AlignIO  handles  IO  for  
only  a  single  alignment  at  a  time  but  SeqIO.pm  handles  IO  for  multiple  sequences  in  a  
single  stream.  Syntax  for  AlignIO  is  almost  identical  to  that  of  SeqIO:  
use Bio::AlignIO;
$in = Bio::AlignIO->new(-file => "inputfilename" ,
-format => 'clustalw');
$out = Bio::AlignIO->new(-file => ">outputfilename",
-format => 'fasta');
while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); }
The  only  difference  is  that  the  returned  object  reference,  $aln,  is  to  a  SimpleAlign  object  
rather  than  to  a  Seq  object.  
Zgłoś jeśli naruszono regulamin