I've fiddled with my blog template because I decided I wanted more horizontal viewing space, given that it was using less than a third of my 1920 horizontal pixels. If it feels too spread out for you, I added a drag-and-drop handle over to the left to let you resize the main content column. The javascript is pretty primitive. If it breaks, drop me a comment.

Sunday, September 6, 2009

The Journey to Git, Part X—Communicating Between Repositories

So you want to do some collaboration using Git. If you don't know where to start, you're in the right place. Start here. This post, like my earlier Git posts, will take you on a guided tour of how to collaborate with others (or yourself) using Git remoting. It will be light on theory and practical application of principles and instead focus on the "how" so you can start using it as quickly as possible.

In this post, I assume you're comfortable working with a single Git repository with the basic commands like "git add", "git commit", "git branch", "git merge", and so on. If you're not to that point yet, hop back to my earlier posts in this series for a quick walkthrough:


Articles in this series:

Making a Clone

We need an existing repository to start from, so create a directory named "cloneme", change to it, and set up a repository like so:

git init
echo "foo" > foo
git add foo
git commit -m "first commit"

Simple enough: a repository with one commit and one file being tracked. Now move to the parent directory of cloneme, and run:

git clone file:///path-to-cloneme clone

Note: The "/path-to-cloneme" part should be the absolute path to the cloneme directory. It's best to go absolute here for a couple of reasons. Don't use a relative path unless you understand the implications of having a relative path stored in your .git/config file.

You've just performed your first remote Git operation by cloning an existing repository. As you might expect, you now have a complete copy of the "cloneme" project in the "clone" project. Note, however, that it's not just a copy of the working tree. It's a complete clone of the original repository. Git is, after all, a distributed VCS.

All we did in this first clone was basically a filesystem copy since we used the "file://" transport. Git, of course, supports remote operations over networks with other transports: ssh, rsync, http, https, and a native "git" transport. Each has its own, very similar, URL syntax for specifying how to find a remote repository. I use the ssh transport almost exclusively. It's secure and just as easy to use as the file transport.

At this point, you have two repositories with identical content. Running "git log" in both of them, for instance, would produce identical output. Start up gitk now, and you'll see the familiar "master" designator pointing at the head of the branch, but next to it is another thing that says "remotes/origin/master". The initial "remotes" is kind of a namespace that's set aside for specifying branches that are in remote repositories. The next piece, "origin", is the name of the remote repository, and the final one is the name of the branch in that remote repository. When you clone a repository, the cloned one automatically becomes the "origin" for the clone, making for convenient interaction with it, as we'll see in a moment.

What this gitk output is telling you is that the head of the remote repository's master branch is at the same commit as your local master branch... as far as this repo knows. Changes in a remote repository are not automatically detected by gitk, so something in the remote could've changed, but gitk won't reflect it until you "git fetch" it. Let's take a look.

Getting New Changes from the Origin Repo

Go back to cloneme, and make a new commit:

echo bar >> foo
git commit -am "second commit"

Now go back to clone. Both "git log" and gitk will show exactly the same thing as before. As I mentioned, these two commands don't do any remoting, so they have no way of knowing about the change. In order to see the new commit, you need to fetch it:

git fetch

When run with no arguments, this command will retrieve all of the latest changes from the remote repository named "origin". That's some of the convenience that I mentioned earlier. Run gitk again, but this time with "gitk --all", or you'll only see a partial picture. Now you can clearly see that the remote named "origin", which is cloneme, is one commit ahead of clone.

Note: When I say "all of the latest changes", I do mean "all". In this exercise, we're confining our work to a single branch, but "git fetch" retrieves the latest changes from all of the branches of the specified remote, as well as any new branches that have been created.

Next run:

git status

You'll see that it also quite clearly tells you that "origin" is ahead of you with a message like:

Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.

Let's go ahead and do the mentioned "fast forward":

git merge remotes/origin/master

That should seem pretty natural to you. It's just a simple fast-forward merge, the same as you'd use to merge any branch into another. The only difference is that you're effectively merging changes from a remote branch into a local one.

Local and Remote Branches

This is a good place to take a look at exactly what that "remotes/origin/master" thing is. Run:

git branch -a

You should see output like:

* master

The -a flag to "git branch" tells the command to display both local and remote-tracking branches, which is what remotes/origin/master--shown as "origin/master" here--is. It's a local representation of a remote branch. A remote-tracking branch exists for the sole purpose of storing commits that you fetch from remote repositories. You don't ever make any commits or do anything else to them except for fetch remote changes into them.

You can, however, make a local branch that "tracks" a remote-tracking branch and make commits there. We'll get into the details of that later, but you already have one of these. The master branch of the repository in "clone" is a local branch that tracks the remotes/origin/master remote-tracking branch. It was set up this way when you did the clone. That's how Git was able to tell you that you were a commit behind the remote branch. It knows that your local branch "master" is tracking a branch named "master" in the remote named "origin".

The Fast Way: Pull

The fetch and merge are fine for illustrating what's happening, but generally you just want to pull the latest changes from the remote repository directly into your local branch, and the two separate commands are an unnecessary step. Enter "git pull". This command is nothing but a combination of "git fetch" and "git merge". It's even clever enough to figure out what you want it to do without any arguments if you're on a branch that's tracking a remote-tracking branch, like your master branch in "clone". Go make another commit in "cloneme":

echo baz >> foo
git commit -am "third commit"

Now switch to "clone" and simply run:

git pull

Everything happens automatically, and the "master" branch of "clone" now has the new commit in it. As I mentioned, you didn't have to tell "git pull" which branch to merge from because the current branch, "master", tracks "remotes/origin/master", so that's the one it selects for the merge.

Note: Unlike "git fetch", "git pull" doesn't pull all changes from all branches into the matching local branches. Since part of a pull is a fetch, it does fetch all of the changes into the remote-tracking branches, but only the current local branch is updated with changes from its respective remote-tracking branch. That is, only one merge is performed upon a pull.

Everything so far has just been in one direction: from the original repository to the clone. Eventually, you'll want to go back in the other direction. Make a fourth commit in the "clone" project:

echo clone >> foo
git commit -am "commit in clone"

Now switch to the "cloneme" project. When you clone a repository, the cloned one doesn't gain any knowledge of the clone, so it should be no surprise that running a simple "git pull" from "cloneme" will get you an error like:

fatal: 'origin': unable to chdir or not a git archive
fatal: The remote end hung up unexpectedly

Detour: Configuring a New Remote Repository

Remember that "git pull" tries to fetch changes from "origin" if you don't tell it something different. Because this repository wasn't cloned from anything, it doesn't have an "origin". We'll need to tell it where it can get changes from by adding a remote repository:

git remote add theclone file:///path-to-clone

Note: Again, /path-to-clone should be the absolute path to the "clone" project.

This adds a remote named "theclone" to this repository's configuration.

Pull Continued

With the newly configured remote, pulling changes is as simple as:

git pull theclone master

Why the extra arguments? Well, first, we have to specify the name of the remote, since the default is "origin". We could have named our remote "origin", but that's not really what it is, so I picked something else. As for the "master" part, since our current branch--the local branch "master"--isn't set up to track any remote-tracking branches, "git pull" doesn't have any information about which remote branch to merge changes from. Therefore, we explicitly state which branch we want to use. The changes are pulled into the current branch.

You now know how to clone repositories, add remotes, and pull changes. That's about all you need to know to start using Git to collaborate on projects; however, there's one more thing that Git lets you do: push. Because of what it does, it's somewhat more difficult to use correctly. There are some caveats, which I'll mention as we go along.

Pushing Changes Instead of Pulling

When would you need to push changes out instead of pulling them in? Well, it's great that Git is distributed and that everyone has their own complete repository for working in, but if you were working on a project team of even moderate size, you can imagine how difficult it would be to say what the "current" state of the project is if everybody just has their own repos and swaps changes at will. You would want to create what Git terms a "blessed" repository. That's a repository where finished work gets pushed to and where you pull from to get the latest "official" state of the project.

Warning--Angels Fear This

Before we go on, let me clearly state that the Git FAQ says you should only push to a bare repository "until you know what you are doing". A bare repository is one that was created with the --bare option. It has no working tree. It says this because pushing into a branch that is checked out to a working tree can be problematic. That's what we're going to do here, though, because properly managed, it's not an issue, and I find it to be very useful to sync changes between two different computers that I'm working on. Just realize that the issues we'll encounter related to working tree state don't arise when you follow the FAQ's advice of pushing only to bare repos.

The Simple Push

At this point, your two repositories, "cloneme" and "clone" should be in sync. That is, they both have the same set of four commits in them. A "git pull" from either side will end with an "Already up-to-date", and neither has any uncommitted changes. Let's add a new commit to "cloneme" and push it to "clone":

echo pushme >> foo
git commit -am "a commit to be pushed"
git push theclone

The first thing to note is that we didn't specify a branch name, only the name of the remote. When you do that, changes in all local branches are pushed to the remote if a branch with the same name already exists there. In other words, if we were to create a new branch named "mybranch" in project "cloneme" and run "git push theclone" again, no changes would be made because that branch doesn't exist in "clone". If you want to send the new branch across, you could do it by specifying the branch name like "git push theclone mybranch".

Why Push Isn't So Simple

Let's go see what "clone" looks like now. You might be a bit surprised at the result. A "git log" will show you that the latest commit was pushed successfully. However, "git status" shows that you have changes in your index. How did this happen? It was clean before the push. Well, run a "git diff --staged" to see what it says has changed. You should see something like this:

diff --git a/foo b/foo
index 5a347e2..90c3f45 100644
--- a/foo
+++ b/foo
@@ -2,4 +2,3 @@ foo

It's saying that in project "clone", you've removed the line that you just added in "cloneme". Why? Because "git push" does not make any changes to the working tree or index of a remote repository, lest work be lost. Particularly when you push to a remote that's not in your control, you have no way of knowing whether somebody else is making changes to the working tree or index at the same time, and you can imagine the havoc if "git push" were to mess with those changes. So while the new commit was added to the repo, the working tree hasn't been touched, and is in the same state as it was when the HEAD^ commit was the latest. Therefore a "git diff" shows exactly that: the output you would expect from running "git diff HEAD HEAD^" in either of the repositories.

To correct this, since you know that no work will be lost, simply run:

git reset --hard

Now your working tree and index properly reflect the tip of the branch, where you want them to be.

Another Restriction on Push

There's one more caveat about "git push": by default, it will only succeed if you can fast-forward the remote branch(es) you're pushing to. Put another way, if you're pushing from "cloneme" master to "clone" master, then the set of commits in "cloneme" must be a superset of the ones in "clone", or the push can't succeed. Again, it's a question of overwriting someone else's work. The most likely way for this to happen is if you're trying to push changes to a remote branch that you previously pulled from, but someone else has added new commits to it in the meantime. The solution in that case is to do another "git pull" to get the latest changes, and then you'll be able to push because you'll have the required superset of commits.

Of course, you can force Git to do a non-fast-forward push. Just make sure you understand that this will destroy work that's been done! Let's look at an example. In project "clone", make a new commit:

echo loseme >> foo
git commit -am "this commit will be lost by a bad push"

Now go back to "cloneme" and run:

echo destroyer >> foo
git commit -am "this commit will cause the loss of a commit in clone"

First, try a typical push:

git push theclone

It will result in an error like:

 ! [rejected]        master -> master (non-fast forward)
error: failed to push some refs to 'file:///cygdrive/c/dev/projects/clone'

Now force it to do the push with:

git push theclone +master

The '+' indicates that Git should force the push. Go over to "clone" now, and a "git log" will show you that the last commit we made there has disappeared. Because we're pushing to a non-bare repository, the index will still have the lost change in it, but another "git reset --hard" will bring it up to date with the repo.


And that, as they say, is that. Journey complete. If you've read and followed along with all of my Git posts, you may be an incurable geek, and you certainly should know enough to be dangerous with Git and to start seeing how great it is in comparison with a centralized VCS. Aside from a quick command reference, which is almost finished, this is all I plan to post about Git for the time being (finally!!! woohoo!!!). If you have any questions, feel free to drop me a comment, and I'll answer it to the best of my ability.

Late addition: I've published a Git reference card on Scribd that should be good for reminding you of the commands you need to use without having to dig back through these posts.

Friday, August 21, 2009

Book Review: xUnit Test Patterns + Code Hangover

This is not a book review.

This is a book review.

Over the past few weeks, I read another book: xUnit Test Patterns. I posted the review on a different blog: codehangover.com. It's a new blog that I'm coauthoring with some former coworkers of mine. I haven't decided exactly what the division of labor between this and that blog will be, but I intend to put the more formal ones, like book reviews and my Git series (I'll finish it soon!), over there.

Some of the other authors on codehangover.com had their own technical blogs, and we decided to combine efforts to hopefully make a more useful blog and, honestly, one that will draw more traffic and maybe earn us all a bit more from affiliate sales ;)

Friday, August 7, 2009

The Journey to Git, Part IX--Communicating from Git to Subversion

In this second part of the Git/Subversion interaction guide, we'll explore the commands that let you do the equivalent of "svn update" and "svn commit". You need to already have a Git repository that's linked to a Subversion repository. The previous post in this series will help you with that if you need it.


Articles in this series:

Getting a Git copy of a Subversion repository and making local commits/branches/whatever in it is great. How do you get further updates of commits that others have made to Subversion? What do you do when you're ready to send your changes back to Subversion? First, decide which branch it is you want to send or receive commits for, and make sure it's checked out. Both the commands I'm going to discuss work within your current branch.

Before committing anything to Subversion, it's never a bad idea to update first and see if there have been any changes, so let's look at that command first.

Updating from Subversion with Git

The Rebase

It may not seem apparent at first, but when we pull changes from SVN to Git, what we really want is a rebase. Why? To maintain a perfectly linear history for the sake of Subversion. We've seen the git rebase command before. Recall that it's the one that lets you freely squash, edit, delete, and move commits around in a branch. To pull the latest changes from Subversion into your current branch--every Git branch can be traced back to a Subversion branch from which it originated, and that's the one it pulls commits from--you first need a clean index and working tree, and then run:

git svn rebase

Note: See the note at the beginning of my previous post to learn how "git svn rebase" is currently broken in Cygwin and to find a workaround.

Your index and working tree have to be clean because of what the rebase does, which I'll get to in a minute. First, let's examine why we have to use a rebase more in depth. When you're working out of a Subversion repository with Git, you'll always be building on top of Subversion commits. Say you're on a branch where there are two Subversion commits: A and B. Then you make commit C in Git locally. Meanwhile, someone else has made a new commit to Subversion--call it X. When you go to pull that change from Subversion down to your local Git repo, where should it go? Your first inclination might be that your local history should become A -> B -> C -> X, but this is 100% wrong. Subversion already has A -> B -> X, and you're not easily going to convince it to put C in front of X. Git, on the other hand, has no problem at all sticking the X before the C. That's exactly what "git rebase" is for. Therefore, when you pull commit X from Subversion, you want to end up with A -> B -> X -> C. That is, you want Subversion commits to all be on top of each other and ahead of any of your local commits.

Now, a "git svn rebase" behaves much like a normal "git rebase". First, it moves all of your local commits--the ones that Subversion doesn't know about yet--out of the way, effectively taking them out of the branch and making HEAD point at the latest SVN commit that's in your Git repo. Then it pulls down all of the SVN commits that aren't represented locally and applies them one at a time on the HEAD of the current branch. This part should all go smoothly because it's basically just copying history from Subversion into Git. After that's done, the rebase puts your commits back on to the HEAD of the current branch, again one at a time.

When Conflict Occurs

When your commits are being reapplied, it's quite possible that you'll experience conflicts with the new commits you got from Subversion, just as you would with "svn update", and this is the main reason that your working tree and index must be clean before you start any kind of rebase. In the event of a conflict, you use your working tree and index to resolve it.

It's very important that you pay attention to Git's messages during any kind of rebase because if there is a conflict, it becomes an interactive process. The rebase will stop, and you'll be looking at a dirty working tree with unmerged files and possibly some staged changes in the index. All the changes you see are the content of the commit Git was trying to apply when the conflict happened. It's waiting for you to resolve the conflict somehow and then tell it to continue the rebase with:

git rebase --continue

Conflict resolution works exactly as I described in my post on merging. Note that we continue the rebase with the "git rebase" command and not "git svn rebase". Once the rebase is kicked off, it acts just like any other rebase you'd perform in Git, and as with any other rebase, resolving the conflicts and continuing is only one of your three options. You can also skip the current commit with:

git rebase --skip

You'd use this if, for instance, the current commit is no longer applicable because of an upstream commit that you've received. The commit is effectively deleted, and it won't be in the branch when the rebase is complete. The third option is to abort the rebase entirely with:

git rebase --abort

You can always abort up until the rebase is completely finished. An abort takes you back to the state you had before you started the rebase. If you really get yourself in a bind, or if you decide you just don't have the knowledge to resolve the conflicts effectively, you can always abort and come back to it later.

The really important thing about rebasing is that when it stops in the middle, you must see it through to the end one way or another. Don't go off to work on something else until the rebase is complete, or you're really going to confuse yourself.

Now that we've got all the incoming changes, let's see how to send changes back to Subversion.

Commiting to Subversion with Git

Compared to what we've seen so far about Git-SVN interaction, sending your local commits back to SVN is a breeze. Just make sure you have a clean working tree and index, then run:

git svn dcommit

There's not much that can go wrong with this command. The working tree and index have to be clean because a dcommit ends with a rebase or a reset, and we've already seen why you have to clean those up before a rebase. I'm not certain exactly what the rebase/reset does, but I think it has to do with putting the "SVN version" of the commit in your branch in place of your local commit.

When the dcommit is complete, there will be a commit in Subversion matching each of the local commits you had in Git, and the commits in Git will all reflect a git-svn-id, indicating that they're recorded in Subversion.

That's it. Two commands are all you need to swap commits with a Subversion repository. Next up, I'll talk about interacting with remote Git repositories. You'll find that the basics are quite similar to the Subversion interaction, but it's a lot more powerful.

The Journey to Git, Part VIII--Connecting Git to Subversion

In this post, we'll start seeing how to use Git as a client to a Subversion repository. This is an excellent way to get your feet wet with Git without forcing the learning curve on others working on the same project. It might also be a useful intermediate step in moving from SVN to Git by getting all the members of a team accustomed to Git while still having their old SVN client as backup in case they get lost. As has happened previously, when I got to the end of what I thought was one post, I decided it was way too long, so I'm breaking it up into two pieces. This piece discusses cloning an existing Subversion repo and what you'll have after you do that. The next one explains the commands you use to trade commits with Subversion: the equivalents of "svn commit" and "svn update".

Before you dig in here, you should be able to use basic Git commands like commit, checkout, and branch. If you're not comfortable with that, have a look at my earlier posts:


Articles in this series:

Setting Up for Subversion Interaction

All Subversion interaction is done through a set of special sub-commands that start with "git svn". If you're running Git through Cygwin, there's two things to note. (Non-Cygwin-users can skip to the next paragraph.) First, you need to install the "subversion-perl" package in Cygwin to be able to use the "git svn" set of commands. Second, a change introduced to Cygwin a few months ago slightly broke "git svn" under Cygwin. See this message for a description of the problem and a workaround for it. When I refer to the "git svn rebase" command in the next post, you'll need to use the mentioned workaround in its place. It may also affect the "git clone" command, but I've neither checked it for myself nor seen any reports on it.

Only Cygwin Git users need to do anything special. On Linux and in mysysgit, everything is already in place. I expect the primary way that Java developers will start using Git is by cloning an existing SVN repo, and that's what I'm going to go through here.

Cloning an Existing Subversion Repo

To use Git to work against an existing SVN repository, your first step is to clone it. Remember that Git is a Distributed VCS, meaning you have your own copy of the entire repository. Cloning SVN is one way to get one.

Note: This can take several hours on a large project with a moderate number of branches because of the way SVN stores branches and the sheer number of files that have to be transferred!

If you're ready to start the clone, get the URL of your SVN repo, and switch to the directory in which you want your project directory to live. The "git clone" command creates a subdirectory and checks out the project in it. For a Subversion repo using the standard directory layout--that is, directories named trunk, branches, and tags--run:

git svn clone --stdlayout --username=<your username> <svn url> foo

If your repository structure differs from the standard layout, use this form instead:

git svn clone --trunk=<trunk dir> --tags=<tags dir> --branches=<branches dir> --username=<your username> <svn url> foo

Username is only required if using authentication, obviously, and "foo" is the name of the directory to create to hold the project. If you don't provide this, then the last bit of the URL--after the final '/'--will be used as the directory name.

Working with a Git Clone of a Subversion Repository

After a successful clone, the target directory will have a typical Git repository and working tree in it. Your standard master branch will be there, and master's HEAD is what's checked out. In my experience, the content of master will be the Subversion branch with the most recent commit on it, but I haven't seen this behavior documented anywhere. You can see all of your Subversion branches and tags by running:

git branch -r

Note: I haven't covered remote Git interaction yet, so this may stray a bit into unfamiliar territory. Just understand that master is your local branch, where you do your work. If you were to "git commit" something, this is where that commit would go. All the Subversion branches you just saw are called "remote-tracking branches". For the most part, you can pretend they're not there. They just act kind of like a mirror of the Subversion repo so that you always have a copy of it around. The usefulness of this will become apparent later. Finally, master "tracks" one of the remote Subversion branches, meaning initially it contains exactly the same commits as that branch, and when you commit to or update from Subversion, that's the branch you'll be interacting with.

So now you have a Git copy of your SVN repo. What next? Well, now you can develop away using Git just like you always would: make commits, branch, merge, rebase, etc. There's just one caveat: don't fool with the history that came from Subversion. Don't try to rebase and change SVN commits around, for example. Just treat the commits from SVN as read-only. Immutable. Untouchable. Get the idea? If you screw with SVN's tiny brain in that way, don't come back to me unless it's only to describe how your SVN or Git repo melted down! I'd be interested to hear about that. You'll know the SVN commits because when you "git log", you'll see a special "git-svn-id" line in each SVN commit. Other than that, it's open season for making changes.

Handling Different Subversion Branches from Git

Of course, there's just the one local branch--master--and it's tracking just the one Subversion branch. That means any changes you make on master will always be sent back to that same Subversion branch. How do we send changes to a different branch? Just create a new local branch that tracks the remote-tracking branch you want to interact with. To create and switch to it with one command, run:

git checkout -b <new branch name> --track <remote branch name>

Now all commits made in your Git branch <new branch name> will eventually be sent to the Subversion branch <remote branch name>, and when you update from Subversion on that local branch, you'll get changes from that remote branch. Remember that you can use "git branch -r" to see all the remote branches that Git knows about. If you don't see the branch you want, then you'll need to use this command to refresh your remote-tracking branches with the latest Subversion changes, including new branches:

git svn fetch

Note: If there's a new branch to fetch, it can take a while, though not as long as the initial clone.

The one thing to keep in mind when branching in Git is to make sure to keep the history linear from Subversion's perspective. Don't branch from master and try to commit both the branch and master back to Subversion. Either merge them together, or commit one back, update the other to get it current, and then commit it, too. Again, I'm only interested in hearing about the details of the meltdown.

Those are the high points of cloning a Subversion repo and working in the clone. The next step is to be able to send commits back to Subversion and update the clone that you've made when someone else commits.

Monday, July 27, 2009

The Journey to Git, Part VII--Other Useful Stuff

My previous Git posts were mostly a walkthrough of the basic workflow to get you up and running with Git fast. This post is less that and more a quick survey of other commands that are regularly used and/or useful. Previous posts aren't a prerequisite for this, but you need to at least have a repository with a few commits and branches in it to be able to run the commands and see what they do.


Articles in this series:

See What Changed

One of the most frequent commands is one ubiquitous to version control:

git diff

This command, by default, simply shows you what is different in your working tree from your index. In other words, it shows you what you've changed since the last commit but haven't staged yet. To see changes you've staged for commit, use:

git diff --staged

Of course, you can also use it to view the changes between any two arbitrary commits and/or branches:

git diff <commit|branch> <other commit|branch>

Note: Unless you want to see history in reverse, you always put the older commit first and the newer commit second.

And finally, you can see just the changes to a particular file or set of files by listing their names after the command and any options:

git diff file1 file2 ...

When you're using commands like this that refer to commits, it quickly gets old to look up their hashes, even when you can just copy/paste them. Fortunately, Git provides a concise vocabulary for specifying commits without using hashes. First, "HEAD" always refers to the "tip", or latest commit, of the current branch. You can also typically use the branch name to refer to the same commit, so we have at least one commit in each branch we can always refer to without knowing its hash. After that, when you know the name of any commit, you can use a caret to say "previous", so "HEAD^" means the commit before the latest commit on the current branch. Likewise, "master^" would refer to the commit before the latest commit on branch master. Carets stack, and each additional one signifies one more commit backward: "HEAD^^^^" is four commits before the latest commit on the current branch. This can also be expressed with "HEAD~4". Just use the tilde and a number to go back a specified number of commits. This is just the proverbial tip of the iceberg on specifying commits, but it's likely all you'll need for a great majority of what you'll do.

Given this new way of specifying commits, a command I use quite regularly is:

git diff HEAD^ HEAD

That is: show me what I did in the last commit. One final note: there are many diff viewing GUIs out there, but I'm not going to go into that much right now. If you've made it this far, you can probably manage setting one of them up on your own. I'll just point you at:

git help config

Search the output for "diff.external", and go from there. If you need more help, drop me a comment, and I'll see what I can do.

See History

The Command Line Way

Another often-used command that's common to VCSs is the log command:

git log

We've used this command in previous posts, but I'm going to add a few variations and a bit of detail to your toolbox here.

You might be accustomed to a log command that shows all the commits made on the current branch. Git does things slightly differently. The "git log" command shows all commits contained within the current branch. It's a subtle difference. When you merge a branch into another, not only does the merge commit show up in the destination branch, the commits from the branch that was merged appear as well. That's because all those commits are part of the state of that branch now. This can be slightly confusing to look at sometimes, but there's a handy option that helps you sort out where each commit came from when you need to:

git log --graph

The --graph option represents branches as lines to the left of the commits being shown. Each commit will have an asterisk next to it in one of the lines indicating which branch the commit was actually made on.

Note: The --graph option also changes the ordering scheme of the commits, potentially causing them to not appear in chronological order. I suppose this is supposed to make it easier to read the graph, but I find it distracting. Use the --date-order option to put them back in chronological order.

Another useful option lets you search commit messages and show only commits that match the search pattern:

git log --grep="some text"

Finally, sometimes it's handy to just see commits from certain times:

git log --since=yesterday
git log --until="15 Jul"

Note that none of these options are mutually exclusive. This is perfectly valid:

git log --graph --since="last week" --until="two days ago"

Note: I don't know what handles the date parsing in Git. I've seen it in the docs somewhere, but I can't figure out where. Whatever it is, it's very versatile.

The Fancy Way

Well, "git log" is great and all that, but what about when you really need to get down in the dirt and pick through the complete history of files? That's what gitk is for. It's similar to "git log" but packs much more detail into a screen. Gitk is one of the GUIs that comes with Git. In Cygwin and msysgit, it's installed along with the Git package, but on Linux, it's a separate package named--wait for it--"gitk". In any of them, the name of the command is also "gitk".

Run "gitk", and take a look around the interface. The top part of the screen is your list of commits, just like in "git log". Click a commit to select it. The bottom part is a diff showing what was changed in the currently-selected commit. In between the two is a control panel that, among other things, shows the SHA-1 of the selected commit, lets you search the selected commit for text, and lets you run rather powerful searches of all the commits appearing in the top pane.

In addition to all of that, "gitk" accepts many options that "git log" does, including some that lend themselves extremely well to the graphical representation. For one, you can see all commits from all branches with:

gitk --all

Another great feature is the ability to view any uncommon history between two branches you're thinking about merging:

gitk branch1...branch2

Run that way, it will show all commits from the latest commit on each of the branches back to the nearest common commit between the two: i.e. all the commits you're about to merge together.

Note: Gitk just runs off of the output from "git log", and it uses the --graph option when it does so. This means the commit ordering isn't necessarily chronological, like I explained above. Use "gitk --date-order" to get them back in order by date.

Hopping Around History

In previous posts, I introduced you to "git checkout" as a way to drop your changes to a file by getting the latest version of the file from the repository:

git checkout <path to file>

and as a way to move to a different branch:

git checkout <branch name>

In fact, "git checkout" can move you to any commit:

git checkout <commit>

This command pulls the state associated with the specified commit from the repository and makes it your working tree.

Note: Although you can use checkout to undo your changes to a file by getting that file from the repo, checking out a commit or branch is different. It isn't allowed to overwrite anything, and it does not perform a merge, so if something is in the way of what you're trying to check out, like a change in your working tree that would be lost by the checkout, Git will refuse to do it. There's two ways out of this situation: stash your changes and do the checkout, or use the -f option to "git checkout" to force the checkout, overwriting any changes.

The message from checking out an arbitrary commit brings up an interesting point: after you do it, you're no longer on a branch. You're on what Git calls a "detached HEAD". Interestingly, though, you can still do just about anything, including commit things. Since you're not on a branch, the commits naturally don't get applied to any branch, but they do happen. They're more or less in limbo, though, and you'll never see them again once you move back to a branch unless you do something to put them into a branch.

Since you can commit outside of a branch, it's rather important that you always ensure you're on a branch when you're working. Both "git status" and "git branch" show what branch you're on, if any. To move back onto a branch when you're not on one, just check out a branch again.

Looking Without Leaping

If all you want is to see what a file looked like at some time in the past, there's a much quicker way to do that than checking out the whole commit:

git show <commit>:<path to file>

Cool. Let's end on a short section. Stay tuned for still more Git goodness as I further explain how to interact with other Git repositories and with Subversion--a really cool feature!

Next stop, Subversion interaction.

Wednesday, July 22, 2009

Book Review: Agile Database Techniques

I've decided to read a technical book every two weeks. You out there in tubeland will benefit by getting a book review every (roughly) two weeks. Here's the first, a book that I carried around with me for months meaning to read and finally decided I had to do it because there's a bunch more I want to read, too:

Effective Strategies for the Agile Software Developer
by Scott W. Ambler

The main theme of this book is the impedence mismatch between the traditional management of relational databases and increasingly agile software development, which somewhat mirrors that which exists between an RDBMS and an object-oriented software design. The basic premise is that DBAs still largely tend to define the entire database schema at the very beginning of a project and make it difficult to change, whereas developers are now largely accepting of the fact that software has to evolve instead of being fully designed up front. Some topics are covered because they relate directly to the main theme, and others are geared toward giving DBAs a basic understanding of and common vocabulary with modern software development and developers. Still others, such as the data normalization chapter, strive to do just the reverse for developers--give them a better understanding of the database side of the house. Overall, the author tries to bring DBAs and developers into a common ground where both are aware of the issues that the other must deal with and at the same time urge database professionals to adapt to Agile development, as the future is clearly one of an evolutionary approach to software.

Many of the chapters in this book are essential knowledge for an enterprise developer; however, it was published in 2003, and many concepts that were novel at that time are now taken as a matter of course, so you may already be familiar with much of it. For instance, if you have a good knowledge of Hibernate, you probably won't get much out of the chapters Object-Relational Impedance Mismatch (Chapter 7) or Mapping Objects to Relational Databases (Chapter 14), although they contain fairly important foundational knowledge. A good deal of the material in the book is like this, and it has quite a broad reach, dealing with topics from test-driven development to data normalization to UML.

The book is liberally sprinkled with real-world examples and practical advice. It comes through clearly that Ambler has been there and done that. There's also one chapter that covers a topic that hasn't gained much traction even today: database refactoring. The idea in this chapter is to attempt to loosely apply the idea of code refactoring--"a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior"--to the database to allow for more evolutionary database approaches even in the cases where multiple systems interact with one database. Ambler gives good examples that demonstrate why and when database refactorings are appropriate, and there is a catalog of established refactorings in the Appendix.

All things considered, I'd highly recommend this book for reading by junior developers. More senior ones should skim through it to be sure they're at least conversant on all the topics covered, as they are all very much relevant today.

Saturday, July 18, 2009

The Journey to Git, Part VI--Rewriting History

Welcome to my sixth (!!!) post on Git. I totally didn't expect this to go so long. In this post, we'll look at some methods Git gives you to change the history of your files. With centralized version control, there's a strong tendency to consider things irrevocable once committed. I'll show you that Git has no such constraints.

This post assumes you've read the previous ones in the series or are at least familiar enough with Git to use some of its basic commands. If you've been following along with the commands I've given in previous posts, this continues to build on the repository you've created as a result. If not, just create a repo and commit a file named "foo" to it with a line of text in it. Then add another line and commit again. This should give you enough to work with to see how the commands in this post work.


Articles in this series:

Changing History

Git isn't nearly so picky as SVN about changing or undoing things that you've done in version control. It supplies some very handy options for doing just that quickly and easily.

Note: While the commands are available, consider carefully what you're doing when you use them. Have the commits you're changing already been pushed or pulled out somewhere else? If so, will you or Git be able to resolve the differences if you start messing with the history? The easiest thing is to only change the history if the commit hasn't been propagated to another repository!

Amending Commits

The first command is a relatively simple one. Run:

git commit --amend -m "I decided to change this message"

Now "git log" will show that your most recent commit on the current branch has a different message. That's just the tip of the iceberg, though. You can amend a commit in any way. Modify the file foo again, stage the change, and do another amend commit:

echo "Another change to this file" >> foo
git add foo
git commit --amend -m "Foo needed to change again in this commit"

You can change your most recent commit in any way by simply staging some changes and using this command.

Undoing Commits with Reset

A more powerful history-altering command is "git reset". There are several options you can give this that do subtly different things. Let's look at the default form first. Run:

git reset HEAD^

The "HEAD^" means one commit before HEAD. This is a quick way of referring to commits rather than using "git log" every time to look up the hash. The carets stack, so "HEAD^^^^" would be four commits before HEAD, which can also be expressed as "HEAD~4".

Back to the reset: when you reset to some commit, you're making that commit the latest on the current branch. Any commits after it go away, but the changes they made aren't lost. All the changes from those dropped commits end up in your working tree, so "git status" will show you that foo is modified, and "git diff" will show that the change is the string that you most recently added to the file and committed.

Alternate forms of this, which we won't try here, are with a --hard and a --soft option (the default option is --mixed). With --soft, you get the same behavior as the default, but the changes end up in the index instead of your working tree. With --hard, you drop back to some commit, but all of your changes are lost. Use with caution. This is the Git equivalent of "rm -rf *". Note that the default target of reset is HEAD, meaning that "git reset --hard" will just erase all the work you've done since your last commit. This can be useful at times.

Totally Thrashing Commits with Rebase

Note: I believe you have to have your core.editor configuration option defined in order to use this since it relies on being able to edit text. See post two from this series for configuration stuff.

If you're only familiar with SVN, this one will blow your mind. Use "git log" to get the hash of your first commit, and run:

git rebase -i <hash>

In your text editor, you'll be presented with a list of commits and some basic instructions. The commits are all of the ones from the one after the commit you specified to the current commit--that's <selected commit> + 1 through HEAD, inclusive. By moving the commit lines around, you'll change the order of the commits in this branch. If you decide a commit is trash, delete the line, and when the rebase completes, the commit will be gone. Git does this by removing all of the commits you selected from the branch and then reapplying them in the order you specify.

In addition, you can specify a couple of other things. First, you can choose to amend any of the commits as they're reapplied by simply changing the "pick" before that commit to "e" or "edit". Remember when I said that "git commit --amend" only applies to the most recent commit? (I did say that, right? Back in post three, I think?) Well, that's technically true, but it doesn't mean you couldn't use rebase to go back ten commits to do an amend.

Finally, there's the squash option. "Squash" means compress two or more commits into one. Try this:

echo 1 >> foo
git commit -am "squashme"
echo 2 >> foo
git commit -am "squashme"
echo 3 >> foo
git commit -am "squashme"
git rebase -i HEAD^^^

Note: You'll notice that I used a new flag on commit that let me skip the "git add". Using -a with commit tells it to automatically add all modifications before committing. It only picks up changes to tracked files. The -a flag won't add any new files.

Now, let's squash the commits we made. Select "squash" or "s" for all three of the commits, save, and exit. You'll see the rebase happening, and at the end, you'll be prompted for a new commit message.

This interactive rebase (i=interactive) really points to how Git differs from centralized version control with respect to its attitude toward history. In Git, history is completely mutable until it gets too complicated to control by being replicated to other repositories. As long as it only exists in your own repo, you can change practically anything about it. Git views commits as part of the development process, and therefore recognizes them as things that might need to be cleaned up a bit after the fact. In, for instance, SVN, on the other hand, the general attitude is that once it's in the repository, it's set in adamantium.

Note: I believe that in newer versions of Git, the pop command has been removed in favor of apply, which also existed in older versions, but had different behavior. Check "git help stash" to see the commands available.

In the next--and hopefully last--post that deals with using Git locally, I'll show you how to navigate around your repository, and find out things you want to know about your files and history.

Vote for this article on DZone.

The Journey to Git, Part V--Merging

This post is one of a series on Git. Previously I posted on branching. When you create a branch, you diverge two lines of development. You need a way to join them back up later on, and that's what this post covers.

Before starting this, make sure you've read my other Git posts leading up to it or are comfortable working with branches in Git. This post also assumes you've been following along with the commands in the other posts. If not, you should create a repository with master and put a file named "foo" with a couple lines of text in branch master. I highly recommend that you follow along with the commands so you can see Git working.


Articles in this series:

Now let's get right to it.

Merging Branches

Create a branch from master and move onto it with:

git checkout -b mergeme

Create a file named "baz" with some text in it, and commit it on this new branch:

echo "File on branch mergeme" >> baz
git add baz
git commit -m "Added baz on branch mergeme"

Suppose that's all the work there is to do on mergeme, and we want to merge the final result into our master branch. Just switch to master and merge the branch into it:

git checkout master
git merge mergeme

You'll see output from the merge like:

Updating 645663b..bb64c2c
Fast forward
 baz |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 baz

The first line indicates the two commits participating in the merge with the starting location first and the merge target second. The next line says "Fast forward" because Git didn't actually have to merge anything in order to make this happen. Since no commits happened on master between the time you created mergeme and the time you merged it back in, Git was able to simply "fast forward" master, meaning it just took the changes in the branch and replayed them on master. Had you made another commit on master before merging mergeme back, Git would have actually had to merge the commits together to create a new commit, which brings us to our next topic: any time an actual merge is performed, you have a chance of entering the lovely land of merge conflicts.

Resolving Conflicts

Merge conflicts aren't much different in Git than in any other version control. Let's create one by appending a different line to foo in each of the branches and attempting to merge them:

git checkout mergeme
echo "Conflicting line in mergeme" >> foo
git commit -am "Conflict in mergeme"
git checkout master
echo "Conflicting line in master" >> foo
git commit -am "Conflict in master"
git merge mergeme

If it broke as expected, the merge will have produced output like:

Auto-merging foo
CONFLICT (content): Merge conflict in foo
Automatic merge failed; fix conflicts and then commit the result.

When a merge is successful, the result is automatically committed. However, when there's a conflict, the results of the merge stay in your index and/or working tree. Run "git status" and examine the output. It notifies you of files needing to be merged manually, both at the top of the status output and by marking them as "unmerged". You simply need to resolve the conflicts either manually or with your tool of choice, and make sure the file gets added, then commit the final result yourself. If you've set up your own merge tool--see post two in this series--you can fix the conflict with:

git mergetool

Otherwise, just adjust the content of foo to remove the conflict markers, and add it:

git add foo

Either way, all changes should be staged, and you can now:

git commit -m "I merged this myself"

This merge resulted in a new commit in master because it's a real merge that entailed modifications, whereas the fast-forward we did before essentially just copied the commit from one branch to another.

That covers the basics of merging. I have a feeling I missed something, so this post may change, but for now, let's move on to the next post, where I'll show you how to rewrite history with Git. It's pretty cool: Part VI--Rewriting History

Vote for this article on DZone.

The Journey to Git, Part IV--Branching

Git post 4: Branching. Before you read this post, you should be familiar enough with Git to be comfortable creating a repository and adding and committing files to it. The previous posts in this series will bring you up to that point if you aren't.

Note: This post assumes that you followed along with the commands given in the previous post and have a version-controlled directory with a couple of commits in it. If you don't have that, then create a new git repository in an empty directory, and create a file named "foo" with one line of text in it and commit it. That will get you to where you need to be to start this walkthrough.


Articles in this series:

I'm going to assume that you're already familiar with the concept of a branch in a VCS. What you need to know is that Git branches are first-class citizens. They're very lightweight and easy to create and work with. With Git, it's not a question of if you'll branch, but of how many branches you'll have going at once.

Creating Branches

Let's create a branch for some concurrent development:

git branch mybranch

That produces no output if it's successful, but it creates a new branch named "mybranch" starting from your current location: the most recent commit of your current branch. View your current branches with:

git branch

It should show the following, indicating you have two branches and are currently on branch master:

* master

Now switch to the branch you just created:

git checkout mybranch

Note: You can both create and move to a branch at the same time with: "git checkout -b mybranch"

In a previous post, you used "git checkout" to pick out a specific commit and make that commit your working tree. It does essentially the same thing when you give it a branch name as an argument. In another post, I'll cover checkout a bit more thoroughly. For now, let's create a new file on this branch:

echo "Branch file" >> bar
git add bar
git commit -m "Added bar on mybranch"

Now "git status" should show that you're on branch mybranch and have nothing to commit, while "git log" will show all three of the commits you've made so far. Move back to master with:

git checkout master

Now bar is gone because it only exists on mybranch, and you'll see that "git log" doesn't show the commit that added it. That's because it shows the commit history of the current branch.

Renaming Branches

This will be a short section:

git branch -m mybranch myotherbranch

That's it. The branch is renamed.

Deleting Branches

Deleting branches is almost as simple. Try this:

git branch -d myotherbranch

It should give you a message like:

error: The branch 'myotherbranch' is not an ancestor of your current HEAD.
If you are sure you want to delete it, run 'git branch -D myotherbranch'.

The keyword "HEAD" refers to the most recent commit of the branch you've checked out. What it's telling you is that you have commits in myotherbranch that aren't in your current branch. If that weren't the case, the delete would have succeeded. Git is very flexible, but it tries to protect you from losing things accidentally. Do what the error said and use the flag that forces a delete:

git branch -D myotherbranch

You'll see a confirmation message. That branch--along with the changes on it--are gone. Make a habit of using the -d form, and you're less likely to accidentally lose things by deleting branches that have changes that don't exist elsewhere.

That's really all there is to branching in Git. Like I said, it's very easy to work with branches in Git. Stick around for more branch goodness in the form of merging in my next post: Part V--Merging

Vote for this article on DZone.

The Journey to Git, Part III—The Basics

This post: the most basic commands for interacting with a Git repository. By the end of this post, you should be able to use Git to track basic history for a project. Once again, I strongly encourage you to follow along with the commands in your own environment to help you learn.


Articles in this series:

In Git, if you don't have a repository, you've got nothing, so let's make a directory and put it under version control using Git. Then we can run some commands against it.

Creating a Repository

Make a directory named firstgitproject and change to it. Run:

git init

You'll see a message like, "Initialized empty Git repository in <path> to project>/firstgitproject/.git", and if you look in your project directory, you'll see that there is, indeed, a .git folder. This folder contains the repository you just created along with its configuration.

Status of a New Repo

Let's see what Git has to say about your brand new project with:

git status

You should see three things: that you're "On branch master", that this is the "Initial commit", and that you have "nothing to commit". The "master" branch is always the branch you start off in. In ways, it's similar to SVN's "trunk", but unlike in SVN, there's nothing special about it. You can rename or delete master with no problem. It's just a branch like any other you might create. The "Initial commit" is a little mysterious in Git. It signifies that you have no history to the project yet. Many Git commands will exit with errors when run against this "Initial commit"--i.e. until you make your first commit to the repo, so let's work on that.

Adding Files/The First Commit

Create File--Working Tree

The directory that you're in, where the .git directory is and where all your files for this project would ultimately go, is what Git refers to as your working tree. It's where you do all your work. Let's add a file to your working tree:

echo "Hello World" > foo

Run "git status" again. Now you have an "untracked" file named foo. Untracked means what it sounds like: Git isn't tracking this file. It exists only in your working tree. Let's change that.

Stage Change--Index

Tell Git to track your newly created file with:

git add foo

Another "git status" now shows foo under "Changes to be committed". What you just did, in Gitspeak, is called "staging a change". You took something that has changed in your working tree (a "change") and told Git to include it in the next commit ("staged it") using the add command. Git has a name for this area where you stage your commits to: the index. I don't know the rationale behind the naming, but when you see the documentation refer to the "index", this is what it's talking about.

Commit Change--Repository

The final step is to commit, which takes all the changes in your index and moves them into the repository. You should be familiar with the concept of a repository. It's where the full history of your project is kept. The general workflow, which we just simulated, is this: you have a working tree and index that are in sync with the repository--you've changed no files and staged no changes. You make changes, and now you have a dirty working tree. You stage some or all of the changes to the index, and now you have a clean working tree and dirty index. Then you commit the staged changes to the repository, and you're back to a clean state. Repeat.

So let's finish the cycle:

git commit -m "My first commit"

Note: If you've set the core.editor configuration option, you can omit the "-m <message>" and your editor will be opened, showing you the complete status message. Just type your commit message, save, and exit to complete the commit

The response of this command should look something like:

[master (root-commit)]: created b1468b4: "My first commit"
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 foo

This indicates several things that I'm going to explain only briefly here. First, you made this commit on branch "master". This is the root-commit, meaning the very first in the repository--the one that went on top of the "Initial commit". A commit identifiable by "b1468b4" was created with the given message. In this commit, one file was changed, a single line was inserted, and no lines were deleted. Finally, this commit included the creation of a new file named "foo" with mode 100644 (that's the Linux file mode that indicates file permissions for you Windows guys).

At this point, some people may be surprised at the fact that we just made a commit without having any kind of Git server set up. Recall that Git is a Distributed VCS. You have an entire copy of the repository, and it lives in that .git directory, remember? So all that's involved in a commit is some local file operations. There's no need for any kind of client/server setup. Now back to our regularly scheduled programming.

Modifying Files Under Version Control

Make Change--Working Tree

The file "foo" is now under version control. Let's make a change to it:

echo "Line 2" >> foo

Now a "git status" will indicate that foo is "Changed but not updated". This message is a little ambiguous. What it means is that the file is under version control, but the copy of the file in your working tree differs from that in the repository, and you haven't staged this change yet. It's time to look at a massively useful command, which you're surely familiar with from other VCSs:

git diff

This is probably the most common use of diff, though there are a host of other options available for it. I'll cover some of those later on.

You've now seen all three parts of the "git status" output. To recap, the output contains 1) the untracked files which aren't under version control, 2) files under version control that you've made changes to, and 3) changes--either new files or modifications to version controlled files--that you've staged and will be included in the next commit.

You'll also see in this latest status output that it prompts you with a couple of ways to deal with this file. You can either "add" it--we'll do this in a moment--or "checkout" it. In SVN, you're used to "checkout" being a very rare command used only for the initial retrieval of a project from a repository. In Git, it's a much more common command which means, roughly, "get X from the repository and put it in my working tree". If you were to "git checkout foo", then foo would be replaced by the version in the repository, removing your most recent changes. In this way, it acts like SVN's revert command. Let's not do that now.

Stage Change--Index

Instead, stage your change with:

git add foo

This is something that regularly trips up SVN users trying out Git. In SVN, you only add newly created files. In Git, you "add" every change. It's actually a more consistent approach and very logical when you get used to the idea. Remember that "add" always stages a change--moves it into the index--and it's the index that gets committed. Another "git status" at this point will show you a similar output to before, when you had staged the newly created file, showing that a new file and a modified file are both just considered "changes" that have to be incorporated.

Commit Change--Repository

Go ahead and commit now with:

git commit -m "My second commit"

Take note of the difference between the output from this commit and the previous one. It's somewhat briefer since this isn't the root-commit and there wasn't a new file created.

Seeing Your History

Now look at your commit history with:

git log

This brings up a really handy feature of Git. If the output of any Git command exceeds what will fit on one screen, Git automatically pipes the output through less, making it easily browsable. Once your history grows beyond a few commits, you'll see this behavior regularly with "git log".

In the commit log, you'll see for each commit 1) the unique identifier of the commit--a 40-hex-digit SHA-1 hash, 2) the author of the commit, 3) the timestamp of the commit, and 4) the commit message entered for the commit.

Removing Files from Version Control

Now for one more, trivial command:

git rm foo
git commit -m "I removed foo"

That's pretty self-explanatory, I think. It's what you do to delete files. Go ahead and add foo back, so that we can use it some more:

echo "new foo" >> foo
git add foo
git commit -m "Added foo back"

View Previous Versions

Establishing a history of things isn't very useful if you can't look back through the history. Here's the basic form of two commands to let you get a feel for a simple history. Later on, I'll do a whole post on navigating around a repository and seeing what's in it.

Show a Previous Version

Use "git log" to get the hash of a previous commit, and run:

git show <hash>:foo

That will show you the content of the file foo at that commit. Now use "git log" to pick two different commits, and run:

git diff <older hash> <newer hash>

You'll see the differences between the two versions in patch format. Added lines have a '+' in front, and deleted lines have a '-'.

I'm going to break this post here, though there's lots more to cover. By now, however, you know all you need in order to keep a simple, forward-moving history of any personal project you happen to be running. It's also handy for versioning configuration files if you run, for instance, a Squid proxy server or other local service with complex, text-based configuration. The next couple of posts will primarily cover branching and merging, where Git really shines in comparison to centralized version control.

First, branching: Part IV--Branching

Vote for this article on DZone.

Monday, July 13, 2009

The Journey to Git, Part II—Git Started

This second post in my Git series will just cover setting up Git so that I can keep these things fairly short and well-organized. From here on, everything is very hands-on. If you're serious about learning Git, I highly recommend you follow along with all of the steps to reinforce what you're reading. This is all going to be a command line walkthrough, though Git ships with some nice GUI tools as well. Once you're familiar with the command line and the concepts, the GUI tools should be easy to figure out.


Articles in this series:

Note: This is something of a rough draft right now without many links for reference. After posting it initially and seeing how long it was, I decided to come back and break it up in order to not scare off readers with short attention spans. You know who you are.

Note: For extensive information about any git command, you can use "git help <command>" to see the man page for it. Alternately, just google "git <command>", and the first link will probably always be the kernel.org HTML version of the man page.

On with the show...


First, obviously, install Git. On Linux, it's a simple package installation. On Windows, you can use either msysgit or install the Git package in Cygwin. I've used msysgit only in passing. Most of my experience is evenly split between Cygwin and Ubuntu. The only other thing I'll say here is you do not want msysgit and Cygwin installed at the same time (which is why I didn't use msysgit much). The two are not compatible, and you'll likely see some odd behavior in Cygwin if you do it (like scp thinking your home directory is "c:\program file\msysgit"). If you need further details on installing Git, email me, and I'll try to help.

If Git is properly installed, then you should be able to run:

git --version


There's some configuration we can get out of the way up front. This is all optional, in fact, but at least the first one is certainly a good idea. Your configuration tool in Git is "git config". By default, this command updates a repository's configuration, but with the "--global" flag, it sets options for your user system-wide, so we can set up some preferences before we even have a repo. For more info about the command and a full list of configuration options, see the man page for "git config".

Identify Yourself

Let Git know who you are with:

git config --global user.name "your name"
git config --global user.email "your email"

This name and email are included in the information about every commit that you make.

Pick an Editor

There are a few things that Git needs to open a text editor for, like typing commit messages if you don't want to type them in-line. Give it the path to your preferred editor with:

git config --global core.editor <path>
Pick a Merge Tool

For some reason, I've never gotten into graphical merge tools, preferring instead to just view the raw conflict markers, but I know some people wouldn't even glance at a VCS that didn't offer the option of merge GUIs. Git knows about several Linux merge tools, and if you want to use one of those, you can just use:

git config --global merge.tool <tool>

Built in merge tools are listed in the man page. If you want to use something other than the built-ins, which you probably will if you're on Windows, it's a bit more work. Assuming you're using WinMerge:

git config --global merge.tool winmerge
git config --global mergetool.winmerge.cmd "WinMergeU \$MERGED"

If WinMerge isn't on your path, you should be able to specify the path to it with:

git config --global mergetool.winmerge.path <path>

However I couldn't get that to work. It's fine if you just leave off the path option and add it to your system path.

For Windows

If you're using one of the Windows Git flavors, you'll very likely want to set these options to handle line terminators more gracefully:

git config --global core.autocrlf true
git config --global core.safecrlf true

And if you're using msysgit, you may need:

git config --global core.fileMode false

I had trouble with it that this option fixed.

There are tons more configuration options documented in the man page, but these should get you started. In the next post, we'll finally get to basic, day-to-day commands that will let you start using Git productively: Part III--The Basics

Vote for this article on DZone.

The Journey to Git, Part I—Distributed vs Central VCS

This one has been a long time coming. I've been using Git as a client to Subversion at work now for a number of months. Over the weekend, I attended a session on Git at the NFJS conference, and it served to solidify some of the concepts in my mind. I think I'm ready to do one or more posts on the subject now.
This first post is going to be more an examination of distributed vs. centralized VCS than anything specific to Git itself. The remainder of the posts in this series of currently unknown length will be a quick start tutorial to get you up and running, teaching you how to use Git at a very basic level with just a single, local Git repository. When I was trying to get going with Git, I couldn't find a good reference online to get me up and running in the way I needed, so I'm going to try to make this as complete as possible. I assume zero Git knowledge here, but I do assume you're familiar with version control in general.


Articles in this series:

Why Git?

First, I'm going to examine why we even need to care about a new VCS. What does Git have that SVN doesn't? That really comes down to the fact that they're two different beasts. It's not so much Git vs. SVN, but distributed VCS (DVCS) vs centralized VCS.

Centralized VCS

With centralized version control, a development team is locked in in terms of workflow. You have a local copy of the project (or more than one). You can make changes to your heart's content, but everything is strictly local until and unless you commit back to the repository. Then whatever you did is visible to everyone. There's no in-between here, no compromise. I actually find it quite remarkable that developers the world over have been content with this model for so long. There are a few very immediate problems that come up.
Problems with Central Control
#1: Collaboration
Scenario: You're working on adding a new feature to an application and you want another developer to give you a hand with part of it or look over some code you think is questionable, or maybe he's just working on something else that interacts with your piece. Your code needs to be available for the other guy to work with. With a centralized VCS, your only options are to 1) commit your code, making it visible to everyone else, or 2) share it some other way outside of the VCS. Option 2) is just messy, if workable. Option 1) would probably be preferred for the most part. The problem here is that your code could be in very early development, possibly even broken. We all know there's a horrible taboo on committing broken code, and rightly so with central control. So you're stuck with not being able to share or with having to stop work to make things non-broken to the point that you can safely commit before you can collaborate on a piece of code.
#2: Releases
This problem is significantly more insidious. Any time you commit something to a centralized VCS, you're essentially affirming that this feature will absolutely be included in the next release. This assumes, of course, that you're not doing work on your own branch specific either to your feature or yourself--a whole other worm can with centralized version control. As your release process becomes more streamlined and automated (that's happening, right?) and your release cycles get shorter because you're aiming to be a more agile--that's agile rather than necessarily Agile--team (that's happening, too, right?), it becomes more important to not schedule features for a release, but to schedule a release for a date and put in whatever features are finished on that date. This means it's imperative that unfinished features never be committed to your mainline development branch, like trunk, since you can't know exactly when a feature will be ready for release. Instead, features should be developed in isolation and only moved into the "trunk" branch when completely finished. This leads back to the branch-per-feature idea, which central VCS in general and SVN in specific just aren't designed for.
#3: History
This one's simple and may be less compelling to some. It's a big hangup for me. A VCS is all about keeping track of history, right? Well, who ever said history was perfect? Only a centralized VCS tries to enforce that by demanding that every commit be 100% bug-free since everyone is going to see it. This leads to larger, less frequent commits, potentially as much as days apart. Days! Why have we come to accept this as the norm? I say version control should be no more than an hour behind you at worst. Small, incremental changes are better. VCS should be a coarse-grained undo system.

Distributed VCS

How does a DVCS answer these concerns? Laying out the problems took quite a bit of space, but interestingly, they're all solved by this fundamental difference: in a DVCS, you not only have your own working copy, you in fact have your own copy of the entire repository, or at least the pieces you care about. You can change anything about it in any way, and nothing leaks out to anyone else unless 1) you push it out or 2) someone else pulls it from you intentionally. This solves all three of my problems since nobody, including a release, will see what you've done until you're ready for them to, but at the same time, your changes are available to anyone who's interested.
Now that you know why a DVCS is a good idea, let's get back to Git itself. There seem to be three mainstream DVCSs out there today: Git, Mercurial, and Bazaar. Of the three, Git seems to be most widely used and gaining traction. That's not necessarily a reason to use it, but Git is what I've used, so it's what I'm writing about.
See the next post to begin a walkthrough tutorial of Git: Part II--Git Started.
Vote for this article on DZone.

Sunday, July 12, 2009

NFJS the Beginning

I just got home from the Austin, TX No Fluff Just Stuff conference. Wow! It's the first one I've been to, and it was great. For my reference and yours, I'm going to put some notes here of things that bear remembering. I expect I'll make several posts over the next few days (or more) on topics covered at the conference. Today:

Open Source Debugging Tools

One of the really good sessions I attended was on a variety of open source tools useful for debugging java apps. It's getting to the point, or maybe it's already there, where there's almost no benefit to shelling out the cash for a commercial debugging/profiling application like YourKit. Think “tools” in a loose sense here, because some of these things are just handy CLI commands that make your debugging efforts more efficient. If you're too lazy or don't have time to ready the whole thing, the "must-see"s for me in this list were OQL (in jhat or Eclipse Memory Analyzer--I don't think VisualVM supports it yet), omniscient debuggers, and BTrace.
I believe all the tools I mention are available for Windows, Linux, or Mac unless otherwise stated. Certainly the Java-based ones are. It started off with tools that aren't actually Java-oriented, but instaed are for network debugging:
  • cURL – absolutely raw HTTP interaction. This guy is great for debugging any of the many protocols that use HTTP: normal web requests, AJAX, SOAP, web services, etc. You can specify any request method, parameters, body, headers, etc. Also has modes for HTTPS, FTP(S), SCP, and more.
  • Tcpdump – watch raw network traffic at the packet level. This tool shows some or all of the content of packets passing through the network interface of your computer. It's highly configurable to capture only exactly what you want.
  • Wireshark – GUI for viewing tcpdump output, formerly known as Ethereal. Wireshark also includes tcpdump and can be used to do the captures. Often, though, you're working in a headless environment, in which case you'd need to use tcpdump and ship the dump file back to where you can analyze it with the excellent Wireshark interface.
Now to the Java tools. A “**” indicates that a tool is distributed with the JDK since version 1.5. I got tired of typing it over and over:
  • Jmeter – Apache Jmeter has come a long way. It's primarily a load testing tool that can run against a number of different types of servers that one typically encounters in Java enterprise development. The interface isn't entirely intuitive, but don't give up too quickly. It is extremely flexible and powerful.
  • Jps** – like ps in Linux, but shows Java-specific details about all running JVM processes.
  • Jstat** – shows JVM statistics for a particular JVM. This shows mostly memory- and garbage-collection-related stats.
  • Jconsole**/VisualVM – graphical tool that shows tons of statistics about a JVM as well as JMX Mbeans. They can also initiate and analyze heap dumps. Jconsole is being deprecated in favor of VisualVM. While VisualVM is actually included with the JDK, it's recommended to download it separately since development is ongoing, and the version from the website will likely always be way ahead of the version in your JDK.
  • Jstatd** – a daemon that allows you to connect jps, jstat, and jconsole/VisualVM to remote machines. I.e. run jstatd on the remote machine, and then you can connect one of the other tools to it.
  • Jstack** – shows stacktraces of all threads in a running JVM.
  • Jmap** – displays/dumps content of the Java heap. A heap dump can be very large and expensive, but invaluable in diagnosing performance problems and OOMEs. Dumps are created in a standard format that can be read by a number of tools.
  • Jhat** – analyze and display—actually run an HTTP server that serves the results of—a heap dump. Jhat is quick and effective (given enough of its own heap space to run in!), but since it exposes everything as HTML, it's somewhat limited in comparison to Jconsole/VisualVM or another tool that can parse heap dumps.
  • OQL – not a tool, per se, but I only learned about it this weekend, and it's awesome! Object Query Language is a language for letting you query a heap dump to find out virtually anything you could possibly want to know in one, neat result set! Simple example: “select z from java.lang.String z where z.count > 50” – applied to a heap dump will display all String instances with more than 50 characters. I believe a number of tools, including jhat, support OQL, but I haven't researched this enough yet.
  • Jinfo – display or change JVM options while a JVM is running. For instance, turn on HeapDumpOnOutOfMemoryError on every important JVM you're currently running because it could help you immensely in diagnosing crashes.
  • Javap – class file disassembly. Given any java class file, show a human-readable representation of the bytecode. If the class was compiled without debug information, it will be somewhat less readable.
  • Eclipse Memory Analyzer – Eclipse plugin that can load (but not initiate) heap dumps and run OQL queries against them and has some quite friendly and useful visual representations of the dump.
  • BTrace – a utility that lets you easily inject your own code into a running JVM by dynamically instrumenting bytecode. The code is quite simple. You just have to write a static method and put an annotation on it telling it when to run. Note: the code you can put in a BTrace class is extremely limited. It seems to be mainly intended for creating more verbosity in an application when things aren't working quite right and your logging isn't telling you what you need to know.
  • “Omniscient” debuggers – We didn't have time to get into these, but they look crazy powerful. The idea with these is that the debugger knows everything and will let you do anything, including stepping backward through a program, to figure out where a bug came from. They do all this by recording every single thing that happens in the JVM. Thus while they're crazy powerful, they're also crazy slow, but still... A couple that were mentioned: ODB and TOD.
Finally, a few more non-java tools that deal with general OS stuff:
  • fs_usage (Mac), inotifywait, inotifywatch, lsof (Linux) – tools for watching changes to files and/or seeing what processes are changing files.
  • Process Monitor (Windows) – lets you see what processes are changing files and much, much more for Windows. Not open source, but free. This tool is like Task Manager on steroids.
  • FileMon (Windows) – close cousin of Process Monitor that shows lots more file-related information in Windows. Also not open source, but free.