Lecture 0 - Getting started with Python and Git

Anaconda Installation

We will use Python via the Anaconda distribution.

Download Anaconda here.

Note

There are a variety of different Python distributions; for statistics and machine learning, we recommend Anaconda as it comes with many useful packages pre-installed.

(It is also a good idea NOT to use your computer’s pre-installed Python - you don’t want to accidentally change any system settings!)

Anaconda requires a few GBs of storage - a more lightweight version is Miniconda, which you can download here.

Managing packages

There are many open source Python packages for statistics and machine learning.

To download packages, two popular package managers are pip and conda. Both pip and conda come with the Anaconda distribution. Instructions for how to install packages using conda can be found in this section.

Environments

About

It is good coding practice to use virtual environments with Python. From this blog:

A Python virtual environment consists of two essential components: the Python interpreter that the virtual environment runs on and a folder containing third-party libraries installed in the virtual environment. These virtual environments are isolated from the other virtual environments, which means any changes on dependencies installed in a virtual environment don’t affect the dependencies of the other virtual environments or the system-wide libraries. Thus, we can create multiple virtual environments with different Python versions, plus different libraries or the same libraries in different versions.

Creating an environment for MSDS-597

We recommend creating a virtual environment for your MSDS-597 coding projects. This way, you can have an environment with all the necessary packages and you can easily keep track of what versions of the packages you used.

  1. Open Terminal (macOS) or a shell
  2. Create an environment called msds597 using conda with the command: conda create --name msds597
  3. To install packages in your environment, first activate your environment: conda activate msds597
  4. Then, install the following packages using the command: conda install numpy pandas scikit-learn matplotlib seaborn jupyter ipykernel
  5. To exit your environment: conda deactivate

Here is a helpful cheatsheet for conda environment commands.

For more details about the shell / bash, here is a helpful resource.

VSCode

There are a number of Python IDEs (integrated development environments). In class, we will be using VSCode (download here).

  1. Download lecture-1.ipynb from Canvas and open it in VSCode.
  2. To use your msds597 environment, on the top right hand corner, click “Select Kernel” > “Python Environments” > msds597. If it prompts you to install ipykernel, follow the prompts to install it.

Jupyter notebooks (.ipynb files) are useful to combine code cells with text (as markdown cells).

VSCode also has a Python interactive window (details here).

Git and GitHub

We will be using Git Classroom for homework submission.

Git is a Version Control System (VCS). It is a useful tool for easily tracking changes to your code, collaborating, and sharing. More specifically, Git manages the changes in a set of files – called a repository.

GitHub is a repository hosting service. It provides storage for Git-based projects on the internet. This allows collaborators to see your files, sync changes and make changes themselves.

Setting up

Git config

After installing Git, introduce yourself to Git! The following commands store your name and email in Git’s configuration settings.

On your local computer, in Terminal (or other shell), run:

git config --global user.name "Jane Doe"
git config --global user.email "jane@example.com"

substituting your name and the email associated with your GitHub account.

To see the current config settings, run git config --global --list.

What does Git do?

Git stores snapshots of a repository over time.

The repository history is not automatically generated; Git must be explicitly told when to record a version of the project. These records are called commits. Commits contain a snapshot of the file contents as well metadata about the repository at that time the commit was created (who made the commit, when it was made, etc.).

From Pro Git (Chacon and Straub)

Local and remote repositories

Typically when we use version control, we create two copies of the repository.

  1. Local repository: this is the repository, usually on our own computer, where we create, edit and delete files.

  2. Remote repository: this repository is stored on a repository hosting service like GitHub - this is easily shareable with collaborators.

When working on files in your local repository and making changes, these changes will only exist locally. To update the remote repository, you need to explictly push your local commits to the remote repo. We will talk about how to push commits later.

From Data Science: A first introduction with Python (Tiffany Timbers, Trevor Campbell, Melissa Lee, Joel Ostblom, Lindsey Heagy)

Local Repo Workflow

In our local repo, our files are in three main states: modified, staged, and committed:

  • modified means that you have changed the file but have not committed it to your history yet.
  • staged means that you have marked a modified file in its current version to go into your next commit snapshot.
  • committed means that the data is safely stored in your local history

The basic Git workflow goes something like this:

  • You modify files.
  • You selectively stage just those changes you want to be part of your next commit, which adds only those changes to the staging area.
  • You do a commit, which takes the files as they are in the staging area and stores that snapshot permanently to your Git history.

GitHub Authentication

When we interact with our remote repository on GitHub, we need to provide credentials for authentication. This proves we are a specific GitHub user, who’s allowed to do whatever we’re asking to do.

You cannot use the password that you use to login to GitHub’s website. (This was possible in the past, but not anymore (GitHub announcement).)

Instead, Git can communicate with a remote server using one of two protocols, HTTPS or SSH, and the different protocols use different credentials. To start with, we recommend HTTPS.

HTTPS details

With HTTPS, we use a personal access token (PAT).

To generate a PAT, go to https://github.com/settings/tokens and click “Generate token”. When creating your PAT, remember to check the “repo” checkbox, and any other desired checkboxes.

When you use an HTTPS connection and are prompted for a password, use this PAT.

More details:

From Happy Git with R:

SSH keys provide a more secure way of logging into a server than using a password alone. While a password can eventually be cracked with a brute force attack, SSH keys are nearly impossible to decipher by brute force alone.

Generating a key pair provides you with two long strings of characters: a public and a private key. You can place the public key on any server (like GitHub!), and then unlock it by connecting to it with a client that already has the private key (your computer!). When the two match up, the system unlocks without the need for a password. You can increase security even more by protecting the private key with a passphrase.

High level overview of set up

On your local computer:

  • Create a public-private SSH key pair. Optionally, encrypt the private key with a passphrase (best practice).
  • Add the private key to your ssh-agent. If you protected it with a passphrase, you may have additional configuration.

On GitHub.com:

Homeworks and Git

Prof Moran has created a starter homework to become familiar with homework submission.

  1. Accept Homework-0 by clicking the following link (not assessed - this is to get familiar with Git) https://classroom.github.com/a/lhMUsxCX

  2. Associate your GitHub account with your NetID on the MSDS-597 GitHub Classroom.

  3. A new repository moran-teaching/hw0-username (with username replaced with your GitHub username) will be created

  4. Copy this repository to a folder on your computer using the following instructions.

Note

Note: after git clone ... using the HTTPS link, you will be prompted for a password. This is your Personal Access Token (PAT) that you created above.

  1. In your local hw0-username folder, create a new markdown file called hw0.md and add some text (e.g. “I have read the github intro.”). You can edit markdown files using VSCode, or other text editors of your choice (here are some resources for markdown in VSCode).

  2. We are now at the modified step of the git workflow. To add files to the staging step, navigate to hw0-username in Terminal. Then, run:

    • git add hw0.md or git add -A

    git add hw0.md adds only the hw0.md file while -A is an option that adds all files to the staging step.

  3. To commit these changes, run:

  • git commit -m "added hw0.md"

    The option -m stands for message - every commit needs a message (here, the message is “added hw0.md”). It is helpful to write descriptive messages so you know what each commit (snapshot) contains.

  1. Now, there is a commit (snapshot) of your changes locally. To submit your assignment, you need to push your changes to GitHub. To do this, run:
  • git push origin main

    In terms of syntax, the command is git push <remote> <branch>, where remote is the server and branch is the branch you want to update. origin is the default name Git gives to the server you cloned from. main is the main branch (we will talk about branches later).

Resources

Git: