SI221 (Fall AY2021)

SI221 Lab 1: Setup and Unix Intro/Refresh

Lab Setup

This semester you will be using the Ubuntu 18.04 Linux operating system. Log into a lab machine with your normal username and password.

Unix Intro/Refresh

The most basic of Unix basics

In this course, you will be using a Unix-based system. As you would suspect, Unix is an operating system - i.e. the interface between programs and the physical machine. There are multiple ways to interact with an operating system, the two main methods being the Graphical User Interface (GUI) and the "shell". The GUI is the program that provides an interactive user interface, like a desktop and those nice windows, menus, mouse click responses, etc. A shell is a command-line interface between you and Unix.

Ubuntu is built on Linux which, as you might have guessed, is built on Unix. You can do a lot by clicking on things just like you would with Microsoft Windows. You also have access to a powerful shell, which, which is where we will be doing a good amount of work. If you click on the Applications top menu item and then System Tools, you'll see the Terminal application. Click on it and up pops a shell terminal (command line) for you to enter commands in. You can also right click the desktop background and select Open Terminal. Or, possibly the fastest way to open a Terminal is to hold Ctrl+Alt and hit the 't' key,

Many people think of command-line interfaces to computers as primitive or outdated. Not true! Computer scientists of all ages live in a command-line. Once you've learned to use the shell, you'll find it's very flexible, powerful and fast. This portion of the lab will familiarize you with several of the commands you'll need to know for this course.

Home directory, current directory, navigating directories

In your shell you always have an idea of being in a directory. The command pwd (print working directory) tells you what directory you're currently "in". In Unix, /'s separate directories. So if pwd gives you

/usr/include

it means that you're "in" the include directory, which is contained in the usr directory, which is contained in the root directory, which is the starting point of the file system. The pwd command always gives the full path from the root directory to your current directory.

The directory you'll always start off "in" when you logon is your home directory. If you give a pwd, you should see that you're "in" the directory /home/m20xxxx,which is your home directory. The way your shell is set up, the current directory is listed in front of the prompt, but instead of listing /home/m20xxxx/ it simply lists ~/. The "~" (tilde) is used in the shell as a shorthand for the full path from the root directory to your home directory. In your home directory, you have the permissions to create, delete and move whatever files or directories you want to.

mkdir

The mkdir command makes a new directory inside the current directory. Create a directory in your home directory called si221 using the command:

mkdir si221

Inside the si221 directory create a new directory called labs. This is the directory where you should store all your lab files for this course, but hold off on creating a new folder for this lab.

The command ls lists the contents of the current directory. If you do an ls now, you should see si221 listed.

The command cd changes the current directory. Type cd si221 to change from your home directory to the si221 directory. To go back one level on the path type cd ... A . refers to the current directory, a .. refers to the directory one up in the path. So, you have ~, . and .. to play with in your cd commands.

Exercise: cd to the root directory and do an ls to see what's there. Move around a bit and find the longest path you can in the directory tree.

Options to commands and `man`

Many Unix shell commands have options which affect the behavior of the command. For example, ls has a -1 (lowercase L) option that causes the contents of the current directory to be listed one entry per line instead of in columns. Documentation for commands, including descriptions of the various options, is available via the command man.

Exercise: type man ls for documentation on the ls command. Space bar pages forward through the documentation, and b pages back. Press q to quit.

Using GitLab

Follow these instructions.

Working with Text Files

One of the philosophies of Unix is the ability to handle text, which is seen as a universal interface. Because of this philosophy, reading, writing and modifying text files is all-important. A program that lets you create and edit text files is called a text editor. There are many text editors for Linux/Unix. A few popular choices:

emacs
vi/vim/gvim
atom
gedit

Real Programmers

It really doesn't matter which editor you use, but the remainder of this section of the lab will provide instructions in gedit. Later on, we explore emacs and vi.

Gedit is a GUI-based editor that runs in its own window, has all the nice menus and buttons you could want, and is adequate for programming. To start gedit type gedit & or, if you know how you'd like to name the file and where you'd like to save it, cd to whatever directory you'd like to create the file in and type gedit FILENAME & where FILENAME could be something like test.cpp or any other name that suits your fancy. The "&" means that the shell doesn't wait for gedit to exit before letting you run another command. You'll see gedit pop up in its own window.

Gedit has menus for most of the basic stuff you do, so a little experimentation should get up an running. If you want to open an existing file, select the Open icon; to create a new one, select the icon to the left of the folder which is a page icon with a green plus sign.

Exercise:

From your lab directory, create a new directory geditplay and move to that directory.

Launch gedit and open a new file named song1. Type in a line from your favorite song and save the file.

Open a new file named song2. Type in a line from a different song but don't save the file.

Use the tabs below the menu icons to switch back and forth between the song1 and song2 files. Note that one of your files has a * in front of the name, indicating it has not been saved. Leave it unsaved for now.

Drag one of the tabs from the gedit window outside of that window and onto the desktop. You have now created two separate gedit windows. This can be very handy when you are editing multiple files such as test.cpp and test.h for example.

Now drag the tab you originally moved back into its original window. Make sure song2 is displayed. Click the "X" located on the song2 tab. You will be prompted to save since that hasn't been done yet (thanks gedit!) After saving, go to your command prompt and do an ls (make sure you're in the directory where you saved it) to convince yourself that song2 is still there.
Note: the most recent version of gedit may not support combining seperated windows back into tabs.

By this point you ought to be pretty comfortable with the absolute basics of gedit. See the link to gedit documentation on the resources page for more information!

Moving, copying, deleting

Download the following file to your Desktop (right click, Save Link As...): test.jpg.

You move files in the shell using the cp command. So to move the file to your home directory,

cp ~/Desktop/test.jpg ~/

This makes a copy of the file test.jpg and places it in your home directory, with the same name. If you give a destination directory for that second argument, it places the copy in that directory and keeps the name the same. If you give a new name for the second argument, the new name is used. So, for example,

cp ~/Desktop/test.jpg ~/mycopy-test.jpg

makes a copy in your home directory and calls it mycopy-test.jpg. Now, you may decide you want to give a file of yours a new name, rather than make a copy with a new name and keep the original around. The mv command moves files - i.e. renames them. The first argument is the name of the file to move, the second is its new name. For example, try:

mv test.jpg rickroll.jpg

Finally, you may decide that you're bored with the file rickroll.jpg and you'd like to delete it altogether. The rm command removes (i.e. deletes) a file.

rm rickroll.jpg

Exercise: Create a directory named temp in your lab directory and make a text file info in it that contains your name and alpha code. Copy the file to a file named info2 in the same directory. Rename info as info1. Rename the directory temp asinfodir. What happened to the files info1 and info2? Try to copy the directory infodir to junk. What happened? Use man to figure out what you'd have to do with cp to copy a directory like infodir. Try to delete the directory infodir. What happened? Use man to figure out what you'd have to do with rm to delete a directory and all of its contents.

Some Unix Tools

Unix has some nice tools for common operations on text files.

wc - "word count", this counts the number of lines, words and characters in a file. "wc -l" just counts number of lines.
grep - at its simplest, grep just searches text for given phrase and prints out all the lines that match that phrase. For example, pick a word (labeled YOUR_WORD below) from your song1 file and issue the command:
```
grep "YOUR_WORD" ~/geditplay/song1
```
You should see the entire line containing "YOUR_WORD" appear on the screen.
sort - sorts the lines of a file, alphabetically by default. The -n option sorts numerically instead, if each line begins with a number.
cut and paste - use the man page to learn more about these commands.

Using tools together

You should've noticed that all of these Unix tools spit their output onto the screen. In reality, they spit their output to standard out, which is your old friend cout in C++ programs. Where "standard out" goes is something you can control. By default, it's the screen, but you can redirect it to a file, if you like. If a unix command is followed by > filename, standard out is redirected to that file. For example:

grep "YOUR_WORD" ~/geditplay/song1 > ~/copy1.txt

creates the file copy1.txt and all of grep's output goes into that file.

Just as there is a standard out, there is a standard in as well. All of these utilities, like grep and wc can be called without any filenames, in which case they read from standard in. Standard in can also be redirected to read from a file. So, for example, wc < ~/copy1.txt is another way to call wc on the file copy1.txt.

When things get really powerful is when the standard out of one command is the standard in from another. A Unix pipe is what we use to tie the output of one program to the input of another. For example, suppose we wanted to know how many files are in the /usr/bin folder. Well,

ls /usr/bin

would spit out text with each file name. We'd like to count the number of individual files, and this is precisely what wc with the -w option does for us. We pipe the output of ls into wc like this:

ls /usr/bin | wc -w

Exercise: cd to the directory /usr/include and determine how many .h files there are. (It may be helpful to use ls -l to list files one per line. Also, use the man page for wc to figure out which options will help you here.)

The Unix "Streams of Text" Philosophy

The Unix command-line philosophy means keeping data in easy-to-read text files and trying to make & use small, simple programs & utilities for processing that text. In this lab we're going to make our own utility in this style.

Computing statistics on large datasets is one very important application of computers. While it might not be serious, baseball is one place where statistics are all-important. There's a website called baseball-reference.com that has tons of stats that you can download. The 2013 American League batting stats have been copied and pasted into the file albat2013. Use this link to download this file into your lab01 directory:

albat2013

Open this file in gedit (or whatever your preferred editor is). To make the text appear orderly, you'll likely need to expand your window. When you have each player's information on one line, go ahead and move through the file with the cursor. You'll notice the row and column position of the cursor on the bottom bar. Notice that characters 88 through 92 on each line have the batting average. We can cut out just those columns of characters from the file with the cut command. The command:

cut -c 88-92 albat2013

does this for us. You'll notice that some lines are not valid batting averages, because some lines of the file are headers and some players didn't have any at bats. If we just want the batting averages we can use grep to filter them out. We only want the lines produced by cut that have a decimal point in them. Thus we're tempted to write grep ".". Unfortunately, the period means something special to grep, so we need to use "\" to escape it. Thus:

cut -c 88-92 albat2013 | grep "\."

cuts out the column of the file that contains the batting averages, the results are piped to grep, which throws away every line except those containing a decimal point. In other words, we get just the batting averages. See how this Unix philosophy is starting to pay off? Well, at least it pays off if you like baseball (or football, or basketball, or any other sports with lots of stats!)

Our own stats utilities

We could write a C++ program that reads in numbers from standard in (good ol' cin) until end-of-file is reached (if you type from the keyboard, ctrl-d gives you the end of file, or eof, character) and prints out the minimum, maximum and average of the values read in. In fact, that's already been done for you! Create a new file in your lab folder named mma.cpp with the following contents:

/************************************************
 * This program reads numbers from standard input
 * (i.e. through cin), computes min, max, and
 * average of the numbers, and prints these
 * values.
 ************************************************/
#include <iostream>
using namespace std;

int main() {
  // Read 1st number & initialize values
  double next, min, max, sum;
  cin >> next;
  min = max = sum = next;
  int count = 1;

  // Read subsequent numbers and update min, max, sum and count
  while(cin >> next) {
    sum += next;
    count ++;
    if (next < min)
      min = next;
    if (next > max)
      max = next;
  }

  // Print results
  cout << "min = " << min << endl;
  cout << "max = " << max << endl;
  cout << "avg = " << sum/count << endl;

  return 0;
}

Compiling with g++

Now that you have your program in text form, you'll need to convert it to executable form. This is a process known as 'compiling'. Once again, in Unix we typically use separate tools for programming rather than use an IDE that takes care of everything for us. That means that we need to run the compiler for ourselves. The compiler we'll be using is g++, which is a freely available "open source" compiler. "Open source" means that you can download the source code (in C in this case) for g++ and modify it if you wish. So if you find a bug, you can fix it yourself! ;-)

Like most unix programs, you run g++ by typing g++ on the command line. You control what it does by "options", and you list the files on which g++ is supposed to operate. To compile, you use the -c option. Compiling a .cpp file (i.e. source file) produces a .ofile (i.e. object) file. So, to compile the file mma.cpp we'd give the command:

g++ -c mma.cpp

The file mma.o will be produced. This is called an 'Object' file. Compiling is a multi-stage process. The link stage links together your object file(s) and any libraries you might be using to create an executable file, which is an actual "program". The -o name option, where name is the name you want for your executable links together any object files you list to form a program. For example, to take the object file mma.o we just compiled and create a program named mma, we'd type:

g++ -o mma mma.o

The output would be an executable file, which can run from the shell

./mma

Shortcuts for compilation

To make our lives a bit easier, g++ does the separate compilations and linking steps for us in a single line if we simply give it the -o name option and list all the source files involved. For the above example, this would be:

g++ -o mma mma.cpp

Obviously, this is more convenient. In fact, g++ simply automatically breaks things up into the three steps we explicitly listed above. So the same things happen when we use this convenient short-hand.

Compile time, link time, run time

People typically refer to three stages in which things "happen" concerning a program: compile time, link time and run time. Understanding at which stage a thing happens can really help you understand it. This is especially true of errors.

A compile time error is something like a missing semi-colon or a type mis-match in a function call. These are things that can be determined from the information available to you in a given compilation unit.

A link time error is something that could not be determined from the information available to a single compilation unit, so it involves some sort of mismatch across two or more compilation units. This would be something like providing two different definitions matching the same prototype - or maybe for providing no definitions for a given protottype.

A run time error is a problem that only crops up once the program is running - for example like writing beyond the end of an array, or forgetting a base case in a recursive function definition. (It seems like the compiler should be able to figure out for itself whether a base case has been forgotten, but it can actually be mathematically proven that this is impossible to always do!)

Exercises

Compile the mma program and use it to find the min, max and average batting average for the American League in 2013. Hint: your max should be above .600 and your average should be around .200.

Use your mma program to find the min, max and average batting average for the Boston Red Sox (BOS), then for the New York Yankees (NYY), and then for the Baltimore Orioles (BAL). Hint: Use grep cleverly before doing the cut-piped-to-grep thing from above. The Red Sox max and average should be above .600 and .200, respectively.

Optional: If you're very ambitious, see if you can use cut and sort -n and some cut-and-paste in gedit to get rid of anyone who had fewer than 20 at bats before you compute the min, max and average for the American League. (AB means at bats.) Hint: your average should be just under .250.

Hopefully you see that by making your stats program fit the whole "streams-of-characters" model it allowed you to combine it with other Unix utilities to answer some questions pretty quickly.

A Multifile program

Compiling programs consisting of several files is not much different than single file compilation. Suppose, for example, we have a program consisting of the following files:

main.cpp

fact.h

fact.cpp

#include <iostream>
#include "fact.h"
int main() {
 cout << fact(5) << endl;
 return 0;
}

// returns the
// factorial of n
int fact(int n);

#include "fact.h"

int fact(int n) {
 if (n == 0)
 return 1;
 return n*fact(n-1);
}

To create an executable factorial from this source, three things have to happen: fact.cpp needs to be compiled into object code, main.cpp needs to be compiled into object code, and the two object code files need to be linked together with standard libraries to form the executable factorial. Here are calls to g++ that get this done:

g++ -c fact.cpp creates fact.o
g++ -c main.cpp creates main.o
g++ -o factorial main.o fact.o creates factorial

It may seem like fact.h isn't involved, but that's not the case. Because it's included in the two .cpp files, it actually gets processed by the compiler in both compilations. Notice that compiling main.cpp and compiling fact.cpp are totally separate from one another. Think about what this means: to compile main.cpp the compiler doesn't need to know anything about fact.cpp - it doesn't need to know anything about the function fact, which it uses, except its prototype, which it gets from fact.h. Similarly, g++ doesn't need to know anything about how the function fact is going to be used in order to compile fact.cpp.

Incremental compilation

Suppose that, starting from the previous example program, we wanted to change things so that the user enters a number, and its factorial is printed out, rather than factorial of 5 every time. We would modify main.cpp appropriately, and recompile. But what really needs to be recompiled? Clearly fact.cpp hasn't changed a bit, so there is no reason to recompiled it. Thus, having made my change to main.cpp, we can produce our new executable with:

g++ -c main.cpp
g++ -o factorial main.o fact.o

This idea that as changes are made to a program we only recompile the things that have changed, rather than recompiling every source file every time, is called incremental compilation. Incremental compilation is important for two reasons. First of all, it allows you to write some source code, compile it, and distribute the object files (i.e. the .o files) rather than the source code (i.e. the .cpp files). People use this commercially to sell libraries, which are collections of object files, while protecting their intellectual property, which is the source code. Second of all, incremental compilation allows you to work on big programs without having to wait hours after every change for compilation. To put things in perspective, a program that Dr. Brown works with for his research - a relatively small program - consists of about 1,500 source files. It takes a fair amount of time to compile all of them. Imagine making a little change, waiting half an hour for the compiler to compile over a thousand files of source code, and then having it say "whoops, you missed a semi-colon"! You'd never get anything done.

Incremental compilation can be a bit tricky once header files get involved. For example, if we change fact.h in the above example, we really should recompile both main.cpp and fact.cpp, since both include it. In other words, if a header file gets modified, everything that includes it really ought to get recompiled ... just in case. And, of course, you actually have to keep track of which files have gotten modified since the last compilation. There is a Unix utility called make that does a lot of this stuff for you. We'll talk more about it later. For the moment, your programs are so small that complete recompilations every time you make a change is reasonable. However, you still need to understand incremental compilation to understand programming languages, because the need to be able to compile different pieces of a program separately, in isolation, affects the way languages work. Each piece that gets compiled in isolation is called a compilation unit. For example, main.cpp (with the contents of fact.h stuck in it) is a compilation unit. Similarly, fact.cpp (with the contents of fact.h stuck in it) is a compilation unit.

Modifying the `mma` Program

Sometimes the "median" is a more informative statistic than the average (also called the "mean"). If you sort a list of numbers from smallest to largest, the median is the middle value in the list if there are an odd number of elements, or the average of the two middle-most elements if there are an even number. The median batting average would really give you a "typical" batting average.

Adding the ability to compute the median is problematic, because you need to store all the values (preferably in an array!) but you don't have any idea how many elements you'll get. The answer, is to store them in a linked list first, then allocate an array of the proper size, then copy the elements into the array, then sort, and finally grab the median (You could also read the file twice; once to find the size and once to input the data. You aren't allowed to do that for this lab).

To help you realize this scheme, the following files have been provided for you: doublelist.h, doublelist.cpp, doublesort.h, and doublesort.o. Download the following file to your machine and extract into your lab01 folder:

lab01_files.zip

Notice that you aren't given you the .cpp file for the implementation of doublesort. How it works is a trade secret, but you're free to link your program to the compiled object code doublesort.o to use it. You may use these files, but you may not modify them!

Exercise: Augment the minimum, maximum, average utility so that it also prints out the median using the files provided. Call the new utility mmam. Remember to place the source files and the compiled executable in your lab folder.

The simplest of `make`s

It can get to be a pain in the neck to retype the compilation command for a multi-file program, even with the up-arrow key to scroll back over previous commands. There is a Unix utility called make that makes it easier to make programs. At its simplest, makejust lets you define "targets" and commands needed to achieve those targets. Simple example: You have two programs, mma and mmam, with several different files involved. You would create the following file called "Makefile":

Makefile

vers1:
    g++ -o mma mma.cpp
\__/
  \_this has got to be a tab!
vers2:
    g++ -o mmam mmam.cpp doublelist.cpp doublesort.o
\__/
  \_this has got to be a tab!

Now, you have two "targets", vers1 and vers2. If you type make vers1 at the command prompt, make will look up the target vers1, and execute the associated rule, in this case g++ -o mma mma.cpp. If you type make vers2 at the command prompt, make will look up the target vers2, and execute the associated rule, in this case g++ -o mmam mmam.cpp doublelist.cpp doublesort.o. Simple makefiles like these make it lots easier to go through that compile, debug, compile, debug, compile, debug, etc. process.

Command line arguments

We'd really have something interesting if we could also define some command line arguments. For example, what if the user could run the program like this:

> mmam -max

and have only the maximum printed out? That'd be kind of cool, and that'd be in the spirit of these Unix utilities we've been using. C++ has a mechanism for passing such command line arguments to main. You can define main with zero arguments, or you can define it with two arguments: and int and an array of c-style strings, i.e. char* []. The two arguments passed to main will be the number of strings the user typed when calling the program (which includes the program itself!), and an array of c-style strings that are each of the strings the user typed on the command line in calling the program. For example, the following test program reads in command line arguments:

test.cpp

#include <iostream>
using namespace std;

int main(int argc, char* argv[]) {
 cout << "argc = " << argc << endl;
 for(int i = 0; i < argc; i++)  {
  cout << "argv[" << i << "] = "
  << argv[i] << endl;
 }
 return 0;
}

Here's what it looks like when compiled and executed:

Now the big thing to remember is that these are c-style strings. You can cast the elements of argv to a C++ string object explicitly, or you can assign them to a C++ string object if you want C++ style strings like we're used to. This really matters for things like comparing for equality, since

if (argv[i] == "-max") { <--- Very bad!
...
}

compares two pointers, rather than comparing strings character by character for equality. You'd be better off doing:

string s = argv[i];
if (s == "-max") {
...
}

Exercise: Add to mmam the functionality of recognizing the command line options -max, -min, -average and -median. Optional: If you're really ambitious, you should allow the user to specify as many as he wants and do all of them. If nothing is specified, all four should be printed out!

Submit Deliverables to the Submission System

Step 1: (As required) Install the submit script to your lab workstations and the VM on your personal laptops:

You only need to do the following and once for your lab workstation and once for your laptop VM. (by installing this file on one lab workstation it will be available on lab workstations for your use.)

Log on to http://submit.cs.usna.edu, returning to this lab page after logging in.
Make a directory called bin in your home directory. This can be done as: mkdir ~/bin
Right-click on "Download Personalized Submission Script", saving the file to the bin directory under your home directory
Open a terminal give the command: chmod 700 ~/bin/submit

Step 2: Submit the program

In the same directory as the mmam.cpp file, give the following command:

~/bin/submit -c=SI221 -p=Lab01 mmam.cpp

The resulting output should include "The submission may be reviewed online at" followed by a URL. Copy that URL and paste it into your browser. The resulting page should tell you how you did. If you did not get the output 100% correct, the page should give you an indication of what's wrong with your output. Keep fixing your program and resubmitting until it works perfectly.

Finishing Up

You should now be comfortable with the basic operations of UNIX at the command line, be refreshed on your ability to generate source code, compile it, execute the program, and correct basic syntax errors.

Click the Power button in the top right corner and…
Choose the Log Out option. You should *not* shutdown or restart the machine.