LEAFF Programming Example

From kmer
Jump to: navigation, search

This is a simple example of using the LEAFF sequence reading classes in a C++ program. The program will extract a single sequence, specified by sequence name, from a fasta file and print a specified range of bases.

Suppose we have a file with 50 sequences, each 100 bases long:

% head test.fasta
>random000000
CCTGAACAATAAGCCTATATTTCCATCTTGGAACACAGCTGCCACTTTGCAGTTCCAGATAGCCCTGTCTCGCCGCACGTAATTTTATTGTTCCAGTTGC
>random000001
GTGGGTCACATTCTCACGCCGGCTTACGGTTGAAGACACTGTCTAATTTTCCGCCCATCCAGATGTAAGTCGGTCCATTTCGCTACGCCACGTAGAGTCG
>random000002
TCACGAAATGTCGGCACGGGAGCCTATTACCTCCCCTCGACACCAAACTGGGCTAATTGCACGAGCCTTTGAATCCCCATTTAAGGGAGTTATCCAGTGT
>random000003
TACAAACCCCTGGGCATACGACGGGAATTCGTCTGTCATCGAACGACCATTTGGGGTTGCTAAAGCCAGTCCATCCCCATGCAAGGTGGCTGAGGATGTG
>random000004
GCGCTGGATTTGGCATGGCACGCGGAGTGATAAAGTTATAGGCATGACTTGCCGATACTGTCATAATGGTAGGCCCTCATTACCTACAGGCTTTCCGAAC

Next, using our 'example' program, we'll extract the first 10 bases from each of these five sequences.

% ./example -b 0 -e 10 -s random000004 -f test.fasta
GCGCTGGATT

% ./example -b 0 -e 10 -s random000003 -f test.fasta
TACAAACCCC

% ./example -b 0 -e 10 -s random000002 -f test.fasta
TCACGAAATG

% ./example -b 0 -e 10 -s random000001 -f test.fasta
GTGGGTCACA

% ./example -b 0 -e 10 -s random000000 -f test.fasta
CCTGAACAAT

And, finally, it does correctly fail if no sequence is found.

% ./example -b 0 -e 10 -s not-here -f test.fasta
ERROR: sequence 'not-here' not found.

The Program

The program is compiled using the following command. You will likely need to adjust the paths to the 'include' and 'lib' directories. These directories are created when kmer is compiled with 'make install'.

g++ -o example example.C -I/work/kmer/FreeBSD-amd64/include -L/work/kmer/FreeBSD-amd64/lib -lseq -lbio -lutil


#include <stdio.h>
#include <stdlib.h>

#include "bio++.H"
#include "seqCache.H"

int
main(int argc, char **argv) {
  unsigned int          bgn = 0;
  unsigned int          end = 0;
  char                 *sid = 0L;
  char                 *N   = 0L;

  int arg=1;
  while (arg < argc) {
    if        (strcmp(argv[arg], "-b") == 0) {
      bgn = atoi(argv[++arg]);

    } else if (strcmp(argv[arg], "-e") == 0) {
      end = atoi(argv[++arg]);

    } else if (strcmp(argv[arg], "-s") == 0) {
      sid = argv[++arg];

    } else if (strcmp(argv[arg], "-f") == 0) {
      N = argv[++arg];

    } else {
      fprintf(stderr, "unknown option '%s'\n", argv[arg]);
    }

    arg++;
  }

  if ((N == 0L) || (sid == 0L)) {
    fprintf(stderr, "usage: %s [-b bgn] [-e end] -s sid -f file.fasta\n", argv[0]);
    fprintf(stderr, "       -b bgn          bgn coordinate of sequence to report\n");
    fprintf(stderr, "       -e end          end coordinate of sequence to report\n");
    fprintf(stderr, "                       coordinates are space-based (0-based)\n");
    fprintf(stderr, "\n");
    fprintf(stderr, "       -s name         name of sequence to output\n");
    fprintf(stderr, "\n");
    fprintf(stderr, "       -f file.fasta   sequences\n");
    exit(1);
  }

  seqCache   *F = new seqCache(N);
  seqInCore  *S = F->getSequenceInCore(sid);

  if (S == NULL)
    fprintf(stderr, "ERROR: sequence '%s' not found.\n", sid), exit(1);

  char       *seq = (char *)S->sequence();

  if (bgn < 0)
    bgn = 0;

  if (S->sequenceLength() < end)
    end = S->sequenceLength();

  char       *rep = new char [end - bgn + 1];

  memcpy(rep, seq + bgn, sizeof(char) * (end - bgn));
  rep[end-bgn] = 0;

  fprintf(stdout, "%s\n", rep);

  delete [] rep;
  delete    S;
  delete    F;

  exit(0);
}