SHELL LAB I

PART I

STEP I: Setup

  1. Download a file in FASTA format containing the sequences of some E. coli genes from here.
  2. Create a directory “laboratorio” in your home with the command mkdir:

                $ mkdir laboratorio

  1. Move the FASTA file from where you downloaded to the directory you just created with the command mv:

                $ mv download-path laboratorio/

  1. Change the current directory to “laboratorio” (go in “laboratorio”).

STEP II: See the content of a file

  1. Use the command cat to see the content of the FASTA file:

                $ cat fasta-path

  1. To see just the beginning of the file use the command head:

                $ head fasta-path

  1. To see more (or less than) ten lines, we can pass an option to head:

                $ head -n 20 fasta-path

  1. To see just the end of the file use the command tail. It works like head and also has an option -n to see more or less than then lines.

STEP III: Using less

  1. Try to see the content of the FASTA file with the command less. It will remain open after you use it.
  2. Inside less you can use the up and down arrows to scroll up and down one line.
  3. You can use page up and page down to go the beginning or to the end of the file.
  4. You can search for text by pressing “/”, writing the text (it appears on the bottom line) and then pressing enter to search for it in the file. Try to search for any sequence of five bases. Do you find it?
  5. Use “q” to close less and go back to the command prompt.

STEP IV: Extracting data

  1. Use the command wc to count lines, words and characters in a file. Try it on our FASTA.
  2. With the option -l, wc prints only the number of lines. Try it.
  3. Use the command grep to select only lines that match a pattern. For example use it to find lines containing the sequence “AAAAA”:

                $ grep “AAAAA” fasta-path

  1. Use grep to extract only the sequence names from the FASTA:

                $ grep “>” fasta-path

  1. Use grep to extract only the sequence bases from the FASTA, with the option -v:

                $ grep -v “>” fasta-path

PART II

STEP I: Pipes

  1. Use grep and wc -l to count the sequences in the FASTA:

                $ grep “>” fasta-path | wc -l

  1. Now try to use grep and wc (without -l) to count lines, words and characters of the lines with sequence bases in the FASTA. The total number of bases is the number of characters minus the number of lines (we need to remove newline characters from the total).

STEP II: A SAM file

  1. Download a SAM file with some alignments of SOLiD reads of V. vinifera from here.
  2. Move it in the directory laboratorio.
  3. Use cat to see the content of the SAM file.
  4. Now try again with less.
  5. Since the lines are very long, add the option -S to avoid wrapping.

STEP III: More grep and wc

  1. Now extract only the header lines of the SAM file with grep.
  2. Extract only lines alignment lines (that is, lines that are not in the header).
  3. Count header lines and alignments. To do this use the grep commands from the previous points and then wc -l, combining them in a pipe as before.
  4. Now try counting the alignments lines with like this:

                $ grep -v “^@” sam-path | wc -l

        Did you get the same number as before?

STEP IV: Other extraction commands

  1. Now try to extract only reverse alignments. Check your results with:

                $ grep ... | less -S

  1. Now count the reverse aligments you extracted.
  2. Lets try a different method. The command cut can extract one or more columns from a tabular file like SAM. You decide the column(s) with the option -f, so cut -f 2 extracts the second column (the flag field). We also need to remove the header since it is not tabular. Try this:

                $ grep -v ‘^@’ sam-path | cut -f 2 | less -S

        You should get only the third field of the alignment lines.

  1. Modify the the last command (without less) to count the reverse alignments. Do you get the same number as before?
  2. Now extract the reference (chromosome) field from the alignments with cut. Again you need grep to remove the header lines and then cut to extract the third column.
  3. Now try the command uniq with the option -c to count how many alignments each chromosome has. uniq -c counts how many times a line is consecutively repeated. Extract the chromosomes as before and then use uniq -c to get a nice little table with the number of alignments for each chromosome.