This semester you will be using the Ubuntu 18.04 Linux operating system. Log into a lab machine with your normal username and password.
The most basic of Unix basics
In this course, you will be using a Unix-based system. As you would suspect, Unix is an operating system - i.e. the interface between programs and the physical machine. There are multiple ways to interact with an operating system, the two main methods being the Graphical User Interface (GUI) and the "shell". The GUI is the program that provides an interactive user interface, like a desktop and those nice windows, menus, mouse click responses, etc. A shell is a command-line interface between you and Unix.
Ubuntu is built on Linux which, as you might have guessed, is built on Unix. You can do a lot by clicking on things just like you would with Microsoft Windows. You also have access to a powerful shell, which, which is where we will be doing a good amount of work. If you click on the Applications top menu item and then System Tools, you'll see the Terminal application. Click on it and up pops a shell terminal (command line) for you to enter commands in. You can also right click the desktop background and select Open Terminal. Or, possibly the fastest way to open a Terminal is to hold Ctrl+Alt and hit the 't' key,
Many people think of command-line interfaces to computers as primitive or outdated. Not true! Computer scientists of all ages live in a command-line. Once you've learned to use the shell, you'll find it's very flexible, powerful and fast. This portion of the lab will familiarize you with several of the commands you'll need to know for this course.
In your shell you always have an idea of being in a directory. The command pwd
(print working directory) tells you what directory you're currently "in". In Unix, /
's separate directories. So if pwd
gives you
/usr/include
it means that you're "in" the include
directory, which is contained in the usr
directory, which is contained in the root directory, which is the starting point of the file system. The pwd
command always gives the full path from the root directory to your current directory.
The directory you'll always start off "in" when you logon is your home directory. If you give a pwd
, you should see that you're "in" the directory /home/m20xxxx
,which is your home directory. The way your shell is set up, the current directory is listed in front of the prompt, but instead of listing /home/m20xxxx/
it simply lists ~/
. The "~
" (tilde) is used in the shell as a shorthand for the full path from the root directory to your home directory. In your home directory, you have the permissions to create, delete and move whatever files or directories you want to.
mkdir
The mkdir
command makes a new directory inside the current directory. Create a directory in your home directory called si221
using the command:
mkdir si221
Inside the si221
directory create a new directory called labs
. This is the directory where you should store all your lab files for this course, but hold off on creating a new folder for this lab.
ls
The command ls
lists the contents of the current directory. If you do an ls
now, you should see si221
listed.
cd
The command cd
changes the current directory. Type cd si221
to change from your home directory to the si221
directory. To go back one level on the path type cd ..
. A .
refers to the current directory, a ..
refers to the directory one up in the path. So, you have ~
, .
and ..
to play with in your cd
commands.
Exercise:
cd
to the root directory and do anls
to see what's there. Move around a bit and find the longest path you can in the directory tree.
man
Many Unix shell commands have options which affect the behavior of the command. For example, ls
has a -1
(lowercase L) option that causes the contents of the current directory to be listed one entry per line instead of in columns. Documentation for commands, including descriptions of the various options, is available via the command man
.
Exercise: type
man ls
for documentation on thels
command. Space bar pages forward through the documentation, andb
pages back. Pressq
to quit.
Follow these instructions.
One of the philosophies of Unix is the ability to handle text, which is seen as a universal interface. Because of this philosophy, reading, writing and modifying text files is all-important. A program that lets you create and edit text files is called a text editor. There are many text editors for Linux/Unix. A few popular choices:
It really doesn't matter which editor you use, but the remainder of this section of the lab will provide instructions in gedit. Later on, we explore emacs and vi.
Gedit is a GUI-based editor that runs in its own window, has all the nice menus and buttons you could want, and is adequate for programming. To start gedit type gedit &
or, if you know how you'd like to name the file and where you'd like to save it, cd
to whatever directory you'd like to create the file in and type gedit FILENAME &
where FILENAME could be something like test.cpp or any other name that suits your fancy. The "&" means that the shell doesn't wait for gedit to exit before letting you run another command. You'll see gedit pop up in its own window.
Gedit has menus for most of the basic stuff you do, so a little experimentation should get up an running. If you want to open an existing file, select the Open icon; to create a new one, select the icon to the left of the folder which is a page icon with a green plus sign.
Exercise:
- From your lab directory, create a new directory
geditplay
and move to that directory.- Launch gedit and open a new file named
song1
. Type in a line from your favorite song and save the file.- Open a new file named
song2
. Type in a line from a different song but don't save the file.- Use the tabs below the menu icons to switch back and forth between the
song1
andsong2
files. Note that one of your files has a*
in front of the name, indicating it has not been saved. Leave it unsaved for now.- Drag one of the tabs from the gedit window outside of that window and onto the desktop. You have now created two separate gedit windows. This can be very handy when you are editing multiple files such as
test.cpp
andtest.h
for example.- Now drag the tab you originally moved back into its original window. Make sure
song2
is displayed. Click the "X" located on thesong2
tab. You will be prompted to save since that hasn't been done yet (thanks gedit!) After saving, go to your command prompt and do anls
(make sure you're in the directory where you saved it) to convince yourself thatsong2
is still there.
Note: the most recent version of gedit may not support combining seperated windows back into tabs.
By this point you ought to be pretty comfortable with the absolute basics of gedit. See the link to gedit documentation on the resources page for more information!
Download the following file to your Desktop (right click, Save Link As...): test.jpg.
You move files in the shell using the cp
command. So to move the file to your home directory,
cp ~/Desktop/test.jpg ~/
This makes a copy of the file test.jpg
and places it in your home directory, with the same name. If you give a destination directory for that second argument, it places the copy in that directory and keeps the name the same. If you give a new name for the second argument, the new name is used. So, for example,
cp ~/Desktop/test.jpg ~/mycopy-test.jpg
makes a copy in your home directory and calls it mycopy-test.jpg
. Now, you may decide you want to give a file of yours a new name, rather than make a copy with a new name and keep the original around. The mv
command moves files - i.e. renames them. The first argument is the name of the file to move, the second is its new name. For example, try:
mv test.jpg rickroll.jpg
Finally, you may decide that you're bored with the file rickroll.jpg
and you'd like to delete it altogether. The rm
command removes (i.e. deletes) a file.
rm rickroll.jpg
Exercise: Create a directory named
temp
in your lab directory and make a text fileinfo
in it that contains your name and alpha code. Copy the file to a file namedinfo2
in the same directory. Renameinfo
asinfo1
. Rename the directorytemp
asinfodir
. What happened to the filesinfo1
andinfo2
? Try to copy the directoryinfodir
tojunk
. What happened? Useman
to figure out what you'd have to do withcp
to copy a directory likeinfodir
. Try to delete the directoryinfodir
. What happened? Useman
to figure out what you'd have to do withrm
to delete a directory and all of its contents.
Unix has some nice tools for common operations on text files.
wc
- "word count", this counts the number of lines, words and characters in a file. "wc -l
" just counts number of lines.grep
- at its simplest, grep just searches text for given phrase and prints out all the lines that match that phrase. For example, pick a word (labeled YOUR_WORD
below) from your song1 file and issue the command:
grep "YOUR_WORD" ~/geditplay/song1You should see the entire line containing "YOUR_WORD" appear on the screen.
sort
- sorts the lines of a file, alphabetically by default. The -n
option sorts numerically instead, if each line begins with a number.cut
and paste
- use the man
page to learn more about these commands.You should've noticed that all of these Unix tools spit their output onto the screen. In reality, they spit their output to standard out, which is your old friend cout
in C++ programs. Where "standard out" goes is something you can control. By default, it's the screen, but you can redirect it to a file, if you like. If a unix command is followed by > filename
, standard out is redirected to that file. For example:
grep "YOUR_WORD" ~/geditplay/song1 > ~/copy1.txt
creates the file copy1.txt
and all of grep
's output goes into that file.
Just as there is a standard out, there is a standard in as well. All of these utilities, like grep
and wc
can be called without any filenames, in which case they read from standard in. Standard in can also be redirected to read from a file. So, for example, wc < ~/copy1.txt
is another way to call wc
on the file copy1.txt
.
When things get really powerful is when the standard out of one command is the standard in from another. A Unix pipe is what we use to tie the output of one program to the input of another. For example, suppose we wanted to know how many files are in the /usr/bin folder. Well,
ls /usr/bin
would spit out text with each file name. We'd like to count the number of individual files, and this is precisely what wc
with the -w
option does for us. We pipe the output of ls
into wc
like this:
ls /usr/bin | wc -w
Exercise: cd to the directory
/usr/include
and determine how many .h files there are. (It may be helpful to usels -l
to list files one per line. Also, use theman
page forwc
to figure out which options will help you here.)
The Unix command-line philosophy means keeping data in easy-to-read text files and trying to make & use small, simple programs & utilities for processing that text. In this lab we're going to make our own utility in this style.
Computing statistics on large datasets is one very important application of computers. While it might not be serious, baseball is one place where statistics are all-important. There's a website called baseball-reference.com that has tons of stats that you can download. The 2013 American League batting stats have been copied and pasted into the file albat2013. Use this link to download this file into your lab01
directory:
Open this file in gedit (or whatever your preferred editor is). To make the text appear orderly, you'll likely need to expand your window. When you have each player's information on one line, go ahead and move through the file with the cursor. You'll notice the row and column position of the cursor on the bottom bar. Notice that characters 88 through 92 on each line have the batting average. We can cut out just those columns of characters from the file with the cut
command. The command:
cut -c 88-92 albat2013
does this for us. You'll notice that some lines are not valid batting averages, because some lines of the file are headers and some players didn't have any at bats. If we just want the batting averages we can use grep to filter them out. We only want the lines produced by cut
that have a decimal point in them. Thus we're tempted to write grep "."
. Unfortunately, the period means something special to grep, so we need to use "\" to escape it. Thus:
cut -c 88-92 albat2013 | grep "\."
cuts out the column of the file that contains the batting averages, the results are piped to grep, which throws away every line except those containing a decimal point. In other words, we get just the batting averages. See how this Unix philosophy is starting to pay off? Well, at least it pays off if you like baseball (or football, or basketball, or any other sports with lots of stats!)
We could write a C++ program that reads in numbers from standard in (good ol' cin
) until end-of-file is reached (if you type from the keyboard, ctrl-d gives you the end of file, or eof, character) and prints out the minimum, maximum and average of the values read in. In fact, that's already been done for you! Create a new file in your lab folder named mma.cpp
with the following contents:
/************************************************
* This program reads numbers from standard input
* (i.e. through cin), computes min, max, and
* average of the numbers, and prints these
* values.
************************************************/
#include <iostream>
using namespace std;
int main() {
// Read 1st number & initialize values
double next, min, max, sum;
cin >> next;
min = max = sum = next;
int count = 1;
// Read subsequent numbers and update min, max, sum and count
while(cin >> next) {
sum += next;
count ++;
if (next < min)
min = next;
if (next > max)
max = next;
}
// Print results
cout << "min = " << min << endl;
cout << "max = " << max << endl;
cout << "avg = " << sum/count << endl;
return 0;
}
Now that you have your program in text form, you'll need to convert it to executable form. This is a process known as 'compiling'. Once again, in Unix we typically use separate tools for programming rather than use an IDE that takes care of everything for us. That means that we need to run the compiler for ourselves. The compiler we'll be using is g++, which is a freely available "open source" compiler. "Open source" means that you can download the source code (in C in this case) for g++ and modify it if you wish. So if you find a bug, you can fix it yourself! ;-)
Like most unix programs, you run g++ by typing g++ on the command line. You control what it does by "options", and you list the files on which g++ is supposed to operate. To compile, you use the -c
option. Compiling a .cpp
file (i.e. source file) produces a .o
file (i.e. object) file. So, to compile the file mma.cpp
we'd give the command:
g++ -c mma.cpp
The file mma.o
will be produced. This is called an 'Object' file. Compiling is a multi-stage process. The link stage links together your object file(s) and any libraries you might be using to create an executable file, which is an actual "program". The -o
name option, where name is the name you want for your executable links together any object files you list to form a program. For example, to take the object file mma.o
we just compiled and create a program named mma
, we'd type:
g++ -o mma mma.o
The output would be an executable file, which can run from the shell
./mma
To make our lives a bit easier, g++ does the separate compilations and linking steps for us in a single line if we simply give it the -o
name option and list all the source files involved. For the above example, this would be:
g++ -o mma mma.cpp
Obviously, this is more convenient. In fact, g++ simply automatically breaks things up into the three steps we explicitly listed above. So the same things happen when we use this convenient short-hand.
People typically refer to three stages in which things "happen" concerning a program: compile time, link time and run time. Understanding at which stage a thing happens can really help you understand it. This is especially true of errors.
A compile time error is something like a missing semi-colon or a type mis-match in a function call. These are things that can be determined from the information available to you in a given compilation unit.
A link time error is something that could not be determined from the information available to a single compilation unit, so it involves some sort of mismatch across two or more compilation units. This would be something like providing two different definitions matching the same prototype - or maybe for providing no definitions for a given protottype.
A run time error is a problem that only crops up once the program is running - for example like writing beyond the end of an array, or forgetting a base case in a recursive function definition. (It seems like the compiler should be able to figure out for itself whether a base case has been forgotten, but it can actually be mathematically proven that this is impossible to always do!)
Exercises
- Compile the
mma
program and use it to find the min, max and average batting average for the American League in 2013. Hint: your max should be above .600 and your average should be around .200.- Use your
mma
program to find the min, max and average batting average for the Boston Red Sox (BOS), then for the New York Yankees (NYY), and then for the Baltimore Orioles (BAL). Hint: Use grep cleverly before doing the cut-piped-to-grep thing from above. The Red Sox max and average should be above .600 and .200, respectively.- Optional: If you're very ambitious, see if you can use
cut
andsort -n
and some cut-and-paste in gedit to get rid of anyone who had fewer than 20 at bats before you compute the min, max and average for the American League. (AB means at bats.) Hint: your average should be just under .250.
Hopefully you see that by making your stats program fit the whole "streams-of-characters" model it allowed you to combine it with other Unix utilities to answer some questions pretty quickly.
Compiling programs consisting of several files is not much different than single file compilation. Suppose, for example, we have a program consisting of the following files:
main.cpp | fact.h | fact.cpp |
#include <iostream> #include "fact.h" int main() { cout << fact(5) << endl; return 0; } |
// returns the // factorial of n int fact(int n); |
#include "fact.h" int fact(int n) { if (n == 0) return 1; return n*fact(n-1); } |
To create an executable factorial
from this source, three things have to happen: fact.cpp
needs to be compiled into object code, main.cpp
needs to be compiled into object code, and the two object code files need to be linked together with standard libraries to form the executable factorial
. Here are calls to g++ that get this done:
g++ -c fact.cpp creates fact.o g++ -c main.cpp creates main.o g++ -o factorial main.o fact.o creates factorial
It may seem like fact.h
isn't involved, but that's not the case. Because it's included in the two .cpp
files, it actually gets processed by the compiler in both compilations. Notice that compiling main.cpp
and compiling fact.cpp
are totally separate from one another. Think about what this means: to compile main.cpp
the compiler doesn't need to know anything about fact.cpp
- it doesn't need to know anything about the function fact
, which it uses, except its prototype, which it gets from fact.h
. Similarly, g++ doesn't need to know anything about how the function fact
is going to be used in order to compile fact.cpp
.
Suppose that, starting from the previous example program, we wanted to change things so that the user enters a number, and its factorial is printed out, rather than factorial of 5 every time. We would modify main.cpp
appropriately, and recompile. But what really needs to be recompiled? Clearly fact.cpp
hasn't changed a bit, so there is no reason to recompiled it. Thus, having made my change to main.cpp
, we can produce our new executable with:
g++ -c main.cpp g++ -o factorial main.o fact.o
This idea that as changes are made to a program we only recompile the things that have changed, rather than recompiling every source file every time, is called incremental compilation. Incremental compilation is important for two reasons. First of all, it allows you to write some source code, compile it, and distribute the object files (i.e. the .o
files) rather than the source code (i.e. the .cpp
files). People use this commercially to sell libraries, which are collections of object files, while protecting their intellectual property, which is the source code. Second of all, incremental compilation allows you to work on big programs without having to wait hours after every change for compilation. To put things in perspective, a program that Dr. Brown works with for his research - a relatively small program - consists of about 1,500 source files. It takes a fair amount of time to compile all of them. Imagine making a little change, waiting half an hour for the compiler to compile over a thousand files of source code, and then having it say "whoops, you missed a semi-colon"! You'd never get anything done.
Incremental compilation can be a bit tricky once header files get involved. For example, if we change fact.h
in the above example, we really should recompile both main.cpp
and fact.cpp
, since both include it. In other words, if a header file gets modified, everything that includes it really ought to get recompiled ... just in case. And, of course, you actually have to keep track of which files have gotten modified since the last compilation. There is a Unix utility called make
that does a lot of this stuff for you. We'll talk more about it later. For the moment, your programs are so small that complete recompilations every time you make a change is reasonable. However, you still need to understand incremental compilation to understand programming languages, because the need to be able to compile different pieces of a program separately, in isolation, affects the way languages work. Each piece that gets compiled in isolation is called a compilation unit. For example, main.cpp
(with the contents of fact.h
stuck in it) is a compilation unit. Similarly, fact.cpp
(with the contents of fact.h
stuck in it) is a compilation unit.
mma
ProgramSometimes the "median" is a more informative statistic than the average (also called the "mean"). If you sort a list of numbers from smallest to largest, the median is the middle value in the list if there are an odd number of elements, or the average of the two middle-most elements if there are an even number. The median batting average would really give you a "typical" batting average.
Adding the ability to compute the median is problematic, because you need to store all the values (preferably in an array!) but you don't have any idea how many elements you'll get. The answer, is to store them in a linked list first, then allocate an array of the proper size, then copy the elements into the array, then sort, and finally grab the median (You could also read the file twice; once to find the size and once to input the data. You aren't allowed to do that for this lab).
To help you realize this scheme, the following files have been provided for you: doublelist.h, doublelist.cpp, doublesort.h, and doublesort.o. Download the following file to your machine and extract into your lab01
folder:
Notice that you aren't given you the .cpp file for the implementation of doublesort
. How it works is a trade secret, but you're free to link your program to the compiled object code doublesort.o
to use it. You may use these files, but you may not modify them!
Exercise: Augment the minimum, maximum, average utility so that it also prints out the median using the files provided. Call the new utility
mmam
. Remember to place the source files and the compiled executable in your lab folder.
make
sIt can get to be a pain in the neck to retype the compilation command for a multi-file program, even with the up-arrow key to scroll back over previous commands. There is a Unix utility called make
that makes it easier to make programs. At its simplest, make
just lets you define "targets" and commands needed to achieve those targets. Simple example: You have two programs, mma and mmam, with several different files involved. You would create the following file called "Makefile
":
Makefile
vers1: g++ -o mma mma.cpp \__/ \_this has got to be a tab! vers2: g++ -o mmam mmam.cpp doublelist.cpp doublesort.o \__/ \_this has got to be a tab!
Now, you have two "targets", vers1
and vers2
. If you type make vers1
at the command prompt, make will look up the target vers1, and execute the associated rule, in this case g++ -o mma mma.cpp
. If you type make vers2
at the command prompt, make will look up the target vers2, and execute the associated rule, in this case g++ -o mmam mmam.cpp doublelist.cpp doublesort.o
. Simple makefiles like these make it lots easier to go through that compile, debug, compile, debug, compile, debug, etc. process.
We'd really have something interesting if we could also define some command line arguments. For example, what if the user could run the program like this:
> mmam -max
and have only the maximum printed out? That'd be kind of cool, and that'd be in the spirit of these Unix utilities we've been using. C++ has a mechanism for passing such command line arguments to main
. You can define main
with zero arguments, or you can define it with two arguments: and int
and an array of c-style strings, i.e. char* []
. The two arguments passed to main
will be the number of strings the user typed when calling the program (which includes the program itself!), and an array of c-style strings that are each of the strings the user typed on the command line in calling the program. For example, the following test program reads in command line arguments:
test.cpp
#include <iostream>
using namespace std;
int main(int argc, char* argv[]) {
cout << "argc = " << argc << endl;
for(int i = 0; i < argc; i++) {
cout << "argv[" << i << "] = "
<< argv[i] << endl;
}
return 0;
}
Here's what it looks like when compiled and executed:
Now the big thing to remember is that these are c-style strings. You can cast the elements of argv
to a C++ string
object explicitly, or you can assign them to a C++ string
object if you want C++ style strings
like we're used to. This really matters for things like comparing for equality, since
if (argv[i] == "-max") { <--- Very bad! ... }
compares two pointers, rather than comparing strings character by character for equality. You'd be better off doing:
string s = argv[i]; if (s == "-max") { ... }
Exercise: Add to
mmam
the functionality of recognizing the command line options-max
,-min
,-average
and-median
. Optional: If you're really ambitious, you should allow the user to specify as many as he wants and do all of them. If nothing is specified, all four should be printed out!
You only need to do the following and once for your lab workstation and once for your laptop VM. (by installing this file on one lab workstation it will be available on lab workstations for your use.)
bin
in your home directory. This can be done as: mkdir ~/bin
chmod 700 ~/bin/submit
~/bin/submit -c=SI221 -p=Lab01 mmam.cppThe resulting output should include "The submission may be reviewed online at" followed by a URL. Copy that URL and paste it into your browser. The resulting page should tell you how you did. If you did not get the output 100% correct, the page should give you an indication of what's wrong with your output. Keep fixing your program and resubmitting until it works perfectly.
You should now be comfortable with the basic operations of UNIX at the command line, be refreshed on your ability to generate source code, compile it, execute the program, and correct basic syntax errors.