I've fiddled with my blog template because I decided I wanted more horizontal viewing space, given that it was using less than a third of my 1920 horizontal pixels. If it feels too spread out for you, I added a drag-and-drop handle over to the left to let you resize the main content column. The javascript is pretty primitive. If it breaks, drop me a comment.
>
>
>
>

Monday, July 27, 2009

The Journey to Git, Part VII--Other Useful Stuff

My previous Git posts were mostly a walkthrough of the basic workflow to get you up and running with Git fast. This post is less that and more a quick survey of other commands that are regularly used and/or useful. Previous posts aren't a prerequisite for this, but you need to at least have a repository with a few commits and branches in it to be able to run the commands and see what they do.

TOC

Articles in this series:

See What Changed

One of the most frequent commands is one ubiquitous to version control:

git diff

This command, by default, simply shows you what is different in your working tree from your index. In other words, it shows you what you've changed since the last commit but haven't staged yet. To see changes you've staged for commit, use:

git diff --staged

Of course, you can also use it to view the changes between any two arbitrary commits and/or branches:

git diff <commit|branch> <other commit|branch>

Note: Unless you want to see history in reverse, you always put the older commit first and the newer commit second.

And finally, you can see just the changes to a particular file or set of files by listing their names after the command and any options:

git diff file1 file2 ...

When you're using commands like this that refer to commits, it quickly gets old to look up their hashes, even when you can just copy/paste them. Fortunately, Git provides a concise vocabulary for specifying commits without using hashes. First, "HEAD" always refers to the "tip", or latest commit, of the current branch. You can also typically use the branch name to refer to the same commit, so we have at least one commit in each branch we can always refer to without knowing its hash. After that, when you know the name of any commit, you can use a caret to say "previous", so "HEAD^" means the commit before the latest commit on the current branch. Likewise, "master^" would refer to the commit before the latest commit on branch master. Carets stack, and each additional one signifies one more commit backward: "HEAD^^^^" is four commits before the latest commit on the current branch. This can also be expressed with "HEAD~4". Just use the tilde and a number to go back a specified number of commits. This is just the proverbial tip of the iceberg on specifying commits, but it's likely all you'll need for a great majority of what you'll do.

Given this new way of specifying commits, a command I use quite regularly is:

git diff HEAD^ HEAD

That is: show me what I did in the last commit. One final note: there are many diff viewing GUIs out there, but I'm not going to go into that much right now. If you've made it this far, you can probably manage setting one of them up on your own. I'll just point you at:

git help config

Search the output for "diff.external", and go from there. If you need more help, drop me a comment, and I'll see what I can do.

See History

The Command Line Way

Another often-used command that's common to VCSs is the log command:

git log

We've used this command in previous posts, but I'm going to add a few variations and a bit of detail to your toolbox here.

You might be accustomed to a log command that shows all the commits made on the current branch. Git does things slightly differently. The "git log" command shows all commits contained within the current branch. It's a subtle difference. When you merge a branch into another, not only does the merge commit show up in the destination branch, the commits from the branch that was merged appear as well. That's because all those commits are part of the state of that branch now. This can be slightly confusing to look at sometimes, but there's a handy option that helps you sort out where each commit came from when you need to:

git log --graph

The --graph option represents branches as lines to the left of the commits being shown. Each commit will have an asterisk next to it in one of the lines indicating which branch the commit was actually made on.

Note: The --graph option also changes the ordering scheme of the commits, potentially causing them to not appear in chronological order. I suppose this is supposed to make it easier to read the graph, but I find it distracting. Use the --date-order option to put them back in chronological order.

Another useful option lets you search commit messages and show only commits that match the search pattern:

git log --grep="some text"

Finally, sometimes it's handy to just see commits from certain times:

git log --since=yesterday
git log --until="15 Jul"

Note that none of these options are mutually exclusive. This is perfectly valid:

git log --graph --since="last week" --until="two days ago"

Note: I don't know what handles the date parsing in Git. I've seen it in the docs somewhere, but I can't figure out where. Whatever it is, it's very versatile.

The Fancy Way

Well, "git log" is great and all that, but what about when you really need to get down in the dirt and pick through the complete history of files? That's what gitk is for. It's similar to "git log" but packs much more detail into a screen. Gitk is one of the GUIs that comes with Git. In Cygwin and msysgit, it's installed along with the Git package, but on Linux, it's a separate package named--wait for it--"gitk". In any of them, the name of the command is also "gitk".

Run "gitk", and take a look around the interface. The top part of the screen is your list of commits, just like in "git log". Click a commit to select it. The bottom part is a diff showing what was changed in the currently-selected commit. In between the two is a control panel that, among other things, shows the SHA-1 of the selected commit, lets you search the selected commit for text, and lets you run rather powerful searches of all the commits appearing in the top pane.

In addition to all of that, "gitk" accepts many options that "git log" does, including some that lend themselves extremely well to the graphical representation. For one, you can see all commits from all branches with:

gitk --all

Another great feature is the ability to view any uncommon history between two branches you're thinking about merging:

gitk branch1...branch2

Run that way, it will show all commits from the latest commit on each of the branches back to the nearest common commit between the two: i.e. all the commits you're about to merge together.

Note: Gitk just runs off of the output from "git log", and it uses the --graph option when it does so. This means the commit ordering isn't necessarily chronological, like I explained above. Use "gitk --date-order" to get them back in order by date.

Hopping Around History

In previous posts, I introduced you to "git checkout" as a way to drop your changes to a file by getting the latest version of the file from the repository:

git checkout <path to file>

and as a way to move to a different branch:

git checkout <branch name>

In fact, "git checkout" can move you to any commit:

git checkout <commit>

This command pulls the state associated with the specified commit from the repository and makes it your working tree.

Note: Although you can use checkout to undo your changes to a file by getting that file from the repo, checking out a commit or branch is different. It isn't allowed to overwrite anything, and it does not perform a merge, so if something is in the way of what you're trying to check out, like a change in your working tree that would be lost by the checkout, Git will refuse to do it. There's two ways out of this situation: stash your changes and do the checkout, or use the -f option to "git checkout" to force the checkout, overwriting any changes.

The message from checking out an arbitrary commit brings up an interesting point: after you do it, you're no longer on a branch. You're on what Git calls a "detached HEAD". Interestingly, though, you can still do just about anything, including commit things. Since you're not on a branch, the commits naturally don't get applied to any branch, but they do happen. They're more or less in limbo, though, and you'll never see them again once you move back to a branch unless you do something to put them into a branch.

Since you can commit outside of a branch, it's rather important that you always ensure you're on a branch when you're working. Both "git status" and "git branch" show what branch you're on, if any. To move back onto a branch when you're not on one, just check out a branch again.

Looking Without Leaping

If all you want is to see what a file looked like at some time in the past, there's a much quicker way to do that than checking out the whole commit:

git show <commit>:<path to file>

Cool. Let's end on a short section. Stay tuned for still more Git goodness as I further explain how to interact with other Git repositories and with Subversion--a really cool feature!

Next stop, Subversion interaction.

Wednesday, July 22, 2009

Book Review: Agile Database Techniques

I've decided to read a technical book every two weeks. You out there in tubeland will benefit by getting a book review every (roughly) two weeks. Here's the first, a book that I carried around with me for months meaning to read and finally decided I had to do it because there's a bunch more I want to read, too:

Effective Strategies for the Agile Software Developer
by Scott W. Ambler

The main theme of this book is the impedence mismatch between the traditional management of relational databases and increasingly agile software development, which somewhat mirrors that which exists between an RDBMS and an object-oriented software design. The basic premise is that DBAs still largely tend to define the entire database schema at the very beginning of a project and make it difficult to change, whereas developers are now largely accepting of the fact that software has to evolve instead of being fully designed up front. Some topics are covered because they relate directly to the main theme, and others are geared toward giving DBAs a basic understanding of and common vocabulary with modern software development and developers. Still others, such as the data normalization chapter, strive to do just the reverse for developers--give them a better understanding of the database side of the house. Overall, the author tries to bring DBAs and developers into a common ground where both are aware of the issues that the other must deal with and at the same time urge database professionals to adapt to Agile development, as the future is clearly one of an evolutionary approach to software.

Many of the chapters in this book are essential knowledge for an enterprise developer; however, it was published in 2003, and many concepts that were novel at that time are now taken as a matter of course, so you may already be familiar with much of it. For instance, if you have a good knowledge of Hibernate, you probably won't get much out of the chapters Object-Relational Impedance Mismatch (Chapter 7) or Mapping Objects to Relational Databases (Chapter 14), although they contain fairly important foundational knowledge. A good deal of the material in the book is like this, and it has quite a broad reach, dealing with topics from test-driven development to data normalization to UML.

The book is liberally sprinkled with real-world examples and practical advice. It comes through clearly that Ambler has been there and done that. There's also one chapter that covers a topic that hasn't gained much traction even today: database refactoring. The idea in this chapter is to attempt to loosely apply the idea of code refactoring--"a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior"--to the database to allow for more evolutionary database approaches even in the cases where multiple systems interact with one database. Ambler gives good examples that demonstrate why and when database refactorings are appropriate, and there is a catalog of established refactorings in the Appendix.

All things considered, I'd highly recommend this book for reading by junior developers. More senior ones should skim through it to be sure they're at least conversant on all the topics covered, as they are all very much relevant today.

Saturday, July 18, 2009

The Journey to Git, Part VI--Rewriting History

Welcome to my sixth (!!!) post on Git. I totally didn't expect this to go so long. In this post, we'll look at some methods Git gives you to change the history of your files. With centralized version control, there's a strong tendency to consider things irrevocable once committed. I'll show you that Git has no such constraints.

This post assumes you've read the previous ones in the series or are at least familiar enough with Git to use some of its basic commands. If you've been following along with the commands I've given in previous posts, this continues to build on the repository you've created as a result. If not, just create a repo and commit a file named "foo" to it with a line of text in it. Then add another line and commit again. This should give you enough to work with to see how the commands in this post work.

TOC

Articles in this series:

Changing History

Git isn't nearly so picky as SVN about changing or undoing things that you've done in version control. It supplies some very handy options for doing just that quickly and easily.

Note: While the commands are available, consider carefully what you're doing when you use them. Have the commits you're changing already been pushed or pulled out somewhere else? If so, will you or Git be able to resolve the differences if you start messing with the history? The easiest thing is to only change the history if the commit hasn't been propagated to another repository!

Amending Commits

The first command is a relatively simple one. Run:

git commit --amend -m "I decided to change this message"

Now "git log" will show that your most recent commit on the current branch has a different message. That's just the tip of the iceberg, though. You can amend a commit in any way. Modify the file foo again, stage the change, and do another amend commit:

echo "Another change to this file" >> foo
git add foo
git commit --amend -m "Foo needed to change again in this commit"

You can change your most recent commit in any way by simply staging some changes and using this command.

Undoing Commits with Reset

A more powerful history-altering command is "git reset". There are several options you can give this that do subtly different things. Let's look at the default form first. Run:

git reset HEAD^

The "HEAD^" means one commit before HEAD. This is a quick way of referring to commits rather than using "git log" every time to look up the hash. The carets stack, so "HEAD^^^^" would be four commits before HEAD, which can also be expressed as "HEAD~4".

Back to the reset: when you reset to some commit, you're making that commit the latest on the current branch. Any commits after it go away, but the changes they made aren't lost. All the changes from those dropped commits end up in your working tree, so "git status" will show you that foo is modified, and "git diff" will show that the change is the string that you most recently added to the file and committed.

Alternate forms of this, which we won't try here, are with a --hard and a --soft option (the default option is --mixed). With --soft, you get the same behavior as the default, but the changes end up in the index instead of your working tree. With --hard, you drop back to some commit, but all of your changes are lost. Use with caution. This is the Git equivalent of "rm -rf *". Note that the default target of reset is HEAD, meaning that "git reset --hard" will just erase all the work you've done since your last commit. This can be useful at times.

Totally Thrashing Commits with Rebase

Note: I believe you have to have your core.editor configuration option defined in order to use this since it relies on being able to edit text. See post two from this series for configuration stuff.

If you're only familiar with SVN, this one will blow your mind. Use "git log" to get the hash of your first commit, and run:

git rebase -i <hash>

In your text editor, you'll be presented with a list of commits and some basic instructions. The commits are all of the ones from the one after the commit you specified to the current commit--that's <selected commit> + 1 through HEAD, inclusive. By moving the commit lines around, you'll change the order of the commits in this branch. If you decide a commit is trash, delete the line, and when the rebase completes, the commit will be gone. Git does this by removing all of the commits you selected from the branch and then reapplying them in the order you specify.

In addition, you can specify a couple of other things. First, you can choose to amend any of the commits as they're reapplied by simply changing the "pick" before that commit to "e" or "edit". Remember when I said that "git commit --amend" only applies to the most recent commit? (I did say that, right? Back in post three, I think?) Well, that's technically true, but it doesn't mean you couldn't use rebase to go back ten commits to do an amend.

Finally, there's the squash option. "Squash" means compress two or more commits into one. Try this:

echo 1 >> foo
git commit -am "squashme"
echo 2 >> foo
git commit -am "squashme"
echo 3 >> foo
git commit -am "squashme"
git rebase -i HEAD^^^

Note: You'll notice that I used a new flag on commit that let me skip the "git add". Using -a with commit tells it to automatically add all modifications before committing. It only picks up changes to tracked files. The -a flag won't add any new files.

Now, let's squash the commits we made. Select "squash" or "s" for all three of the commits, save, and exit. You'll see the rebase happening, and at the end, you'll be prompted for a new commit message.

This interactive rebase (i=interactive) really points to how Git differs from centralized version control with respect to its attitude toward history. In Git, history is completely mutable until it gets too complicated to control by being replicated to other repositories. As long as it only exists in your own repo, you can change practically anything about it. Git views commits as part of the development process, and therefore recognizes them as things that might need to be cleaned up a bit after the fact. In, for instance, SVN, on the other hand, the general attitude is that once it's in the repository, it's set in adamantium.

Note: I believe that in newer versions of Git, the pop command has been removed in favor of apply, which also existed in older versions, but had different behavior. Check "git help stash" to see the commands available.

In the next--and hopefully last--post that deals with using Git locally, I'll show you how to navigate around your repository, and find out things you want to know about your files and history.

Vote for this article on DZone.

The Journey to Git, Part V--Merging

This post is one of a series on Git. Previously I posted on branching. When you create a branch, you diverge two lines of development. You need a way to join them back up later on, and that's what this post covers.

Before starting this, make sure you've read my other Git posts leading up to it or are comfortable working with branches in Git. This post also assumes you've been following along with the commands in the other posts. If not, you should create a repository with master and put a file named "foo" with a couple lines of text in branch master. I highly recommend that you follow along with the commands so you can see Git working.

TOC

Articles in this series:

Now let's get right to it.

Merging Branches

Create a branch from master and move onto it with:

git checkout -b mergeme

Create a file named "baz" with some text in it, and commit it on this new branch:

echo "File on branch mergeme" >> baz
git add baz
git commit -m "Added baz on branch mergeme"

Suppose that's all the work there is to do on mergeme, and we want to merge the final result into our master branch. Just switch to master and merge the branch into it:

git checkout master
git merge mergeme

You'll see output from the merge like:

Updating 645663b..bb64c2c
Fast forward
 baz |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 baz

The first line indicates the two commits participating in the merge with the starting location first and the merge target second. The next line says "Fast forward" because Git didn't actually have to merge anything in order to make this happen. Since no commits happened on master between the time you created mergeme and the time you merged it back in, Git was able to simply "fast forward" master, meaning it just took the changes in the branch and replayed them on master. Had you made another commit on master before merging mergeme back, Git would have actually had to merge the commits together to create a new commit, which brings us to our next topic: any time an actual merge is performed, you have a chance of entering the lovely land of merge conflicts.

Resolving Conflicts

Merge conflicts aren't much different in Git than in any other version control. Let's create one by appending a different line to foo in each of the branches and attempting to merge them:

git checkout mergeme
echo "Conflicting line in mergeme" >> foo
git commit -am "Conflict in mergeme"
git checkout master
echo "Conflicting line in master" >> foo
git commit -am "Conflict in master"
git merge mergeme

If it broke as expected, the merge will have produced output like:

Auto-merging foo
CONFLICT (content): Merge conflict in foo
Automatic merge failed; fix conflicts and then commit the result.

When a merge is successful, the result is automatically committed. However, when there's a conflict, the results of the merge stay in your index and/or working tree. Run "git status" and examine the output. It notifies you of files needing to be merged manually, both at the top of the status output and by marking them as "unmerged". You simply need to resolve the conflicts either manually or with your tool of choice, and make sure the file gets added, then commit the final result yourself. If you've set up your own merge tool--see post two in this series--you can fix the conflict with:

git mergetool

Otherwise, just adjust the content of foo to remove the conflict markers, and add it:

git add foo

Either way, all changes should be staged, and you can now:

git commit -m "I merged this myself"

This merge resulted in a new commit in master because it's a real merge that entailed modifications, whereas the fast-forward we did before essentially just copied the commit from one branch to another.

That covers the basics of merging. I have a feeling I missed something, so this post may change, but for now, let's move on to the next post, where I'll show you how to rewrite history with Git. It's pretty cool: Part VI--Rewriting History

Vote for this article on DZone.

The Journey to Git, Part IV--Branching

Git post 4: Branching. Before you read this post, you should be familiar enough with Git to be comfortable creating a repository and adding and committing files to it. The previous posts in this series will bring you up to that point if you aren't.

Note: This post assumes that you followed along with the commands given in the previous post and have a version-controlled directory with a couple of commits in it. If you don't have that, then create a new git repository in an empty directory, and create a file named "foo" with one line of text in it and commit it. That will get you to where you need to be to start this walkthrough.

TOC

Articles in this series:

I'm going to assume that you're already familiar with the concept of a branch in a VCS. What you need to know is that Git branches are first-class citizens. They're very lightweight and easy to create and work with. With Git, it's not a question of if you'll branch, but of how many branches you'll have going at once.

Creating Branches

Let's create a branch for some concurrent development:

git branch mybranch

That produces no output if it's successful, but it creates a new branch named "mybranch" starting from your current location: the most recent commit of your current branch. View your current branches with:

git branch

It should show the following, indicating you have two branches and are currently on branch master:

* master
  mybranch

Now switch to the branch you just created:

git checkout mybranch

Note: You can both create and move to a branch at the same time with: "git checkout -b mybranch"

In a previous post, you used "git checkout" to pick out a specific commit and make that commit your working tree. It does essentially the same thing when you give it a branch name as an argument. In another post, I'll cover checkout a bit more thoroughly. For now, let's create a new file on this branch:

echo "Branch file" >> bar
git add bar
git commit -m "Added bar on mybranch"

Now "git status" should show that you're on branch mybranch and have nothing to commit, while "git log" will show all three of the commits you've made so far. Move back to master with:

git checkout master

Now bar is gone because it only exists on mybranch, and you'll see that "git log" doesn't show the commit that added it. That's because it shows the commit history of the current branch.

Renaming Branches

This will be a short section:

git branch -m mybranch myotherbranch

That's it. The branch is renamed.

Deleting Branches

Deleting branches is almost as simple. Try this:

git branch -d myotherbranch

It should give you a message like:

error: The branch 'myotherbranch' is not an ancestor of your current HEAD.
If you are sure you want to delete it, run 'git branch -D myotherbranch'.

The keyword "HEAD" refers to the most recent commit of the branch you've checked out. What it's telling you is that you have commits in myotherbranch that aren't in your current branch. If that weren't the case, the delete would have succeeded. Git is very flexible, but it tries to protect you from losing things accidentally. Do what the error said and use the flag that forces a delete:

git branch -D myotherbranch

You'll see a confirmation message. That branch--along with the changes on it--are gone. Make a habit of using the -d form, and you're less likely to accidentally lose things by deleting branches that have changes that don't exist elsewhere.

That's really all there is to branching in Git. Like I said, it's very easy to work with branches in Git. Stick around for more branch goodness in the form of merging in my next post: Part V--Merging

Vote for this article on DZone.

The Journey to Git, Part III—The Basics

This post: the most basic commands for interacting with a Git repository. By the end of this post, you should be able to use Git to track basic history for a project. Once again, I strongly encourage you to follow along with the commands in your own environment to help you learn.

TOC

Articles in this series:

In Git, if you don't have a repository, you've got nothing, so let's make a directory and put it under version control using Git. Then we can run some commands against it.

Creating a Repository

Make a directory named firstgitproject and change to it. Run:

git init

You'll see a message like, "Initialized empty Git repository in <path> to project>/firstgitproject/.git", and if you look in your project directory, you'll see that there is, indeed, a .git folder. This folder contains the repository you just created along with its configuration.

Status of a New Repo

Let's see what Git has to say about your brand new project with:

git status

You should see three things: that you're "On branch master", that this is the "Initial commit", and that you have "nothing to commit". The "master" branch is always the branch you start off in. In ways, it's similar to SVN's "trunk", but unlike in SVN, there's nothing special about it. You can rename or delete master with no problem. It's just a branch like any other you might create. The "Initial commit" is a little mysterious in Git. It signifies that you have no history to the project yet. Many Git commands will exit with errors when run against this "Initial commit"--i.e. until you make your first commit to the repo, so let's work on that.

Adding Files/The First Commit

Create File--Working Tree

The directory that you're in, where the .git directory is and where all your files for this project would ultimately go, is what Git refers to as your working tree. It's where you do all your work. Let's add a file to your working tree:

echo "Hello World" > foo

Run "git status" again. Now you have an "untracked" file named foo. Untracked means what it sounds like: Git isn't tracking this file. It exists only in your working tree. Let's change that.

Stage Change--Index

Tell Git to track your newly created file with:

git add foo

Another "git status" now shows foo under "Changes to be committed". What you just did, in Gitspeak, is called "staging a change". You took something that has changed in your working tree (a "change") and told Git to include it in the next commit ("staged it") using the add command. Git has a name for this area where you stage your commits to: the index. I don't know the rationale behind the naming, but when you see the documentation refer to the "index", this is what it's talking about.

Commit Change--Repository

The final step is to commit, which takes all the changes in your index and moves them into the repository. You should be familiar with the concept of a repository. It's where the full history of your project is kept. The general workflow, which we just simulated, is this: you have a working tree and index that are in sync with the repository--you've changed no files and staged no changes. You make changes, and now you have a dirty working tree. You stage some or all of the changes to the index, and now you have a clean working tree and dirty index. Then you commit the staged changes to the repository, and you're back to a clean state. Repeat.

So let's finish the cycle:

git commit -m "My first commit"

Note: If you've set the core.editor configuration option, you can omit the "-m <message>" and your editor will be opened, showing you the complete status message. Just type your commit message, save, and exit to complete the commit

The response of this command should look something like:

[master (root-commit)]: created b1468b4: "My first commit"
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 foo

This indicates several things that I'm going to explain only briefly here. First, you made this commit on branch "master". This is the root-commit, meaning the very first in the repository--the one that went on top of the "Initial commit". A commit identifiable by "b1468b4" was created with the given message. In this commit, one file was changed, a single line was inserted, and no lines were deleted. Finally, this commit included the creation of a new file named "foo" with mode 100644 (that's the Linux file mode that indicates file permissions for you Windows guys).

At this point, some people may be surprised at the fact that we just made a commit without having any kind of Git server set up. Recall that Git is a Distributed VCS. You have an entire copy of the repository, and it lives in that .git directory, remember? So all that's involved in a commit is some local file operations. There's no need for any kind of client/server setup. Now back to our regularly scheduled programming.

Modifying Files Under Version Control

Make Change--Working Tree

The file "foo" is now under version control. Let's make a change to it:

echo "Line 2" >> foo

Now a "git status" will indicate that foo is "Changed but not updated". This message is a little ambiguous. What it means is that the file is under version control, but the copy of the file in your working tree differs from that in the repository, and you haven't staged this change yet. It's time to look at a massively useful command, which you're surely familiar with from other VCSs:

git diff

This is probably the most common use of diff, though there are a host of other options available for it. I'll cover some of those later on.

You've now seen all three parts of the "git status" output. To recap, the output contains 1) the untracked files which aren't under version control, 2) files under version control that you've made changes to, and 3) changes--either new files or modifications to version controlled files--that you've staged and will be included in the next commit.

You'll also see in this latest status output that it prompts you with a couple of ways to deal with this file. You can either "add" it--we'll do this in a moment--or "checkout" it. In SVN, you're used to "checkout" being a very rare command used only for the initial retrieval of a project from a repository. In Git, it's a much more common command which means, roughly, "get X from the repository and put it in my working tree". If you were to "git checkout foo", then foo would be replaced by the version in the repository, removing your most recent changes. In this way, it acts like SVN's revert command. Let's not do that now.

Stage Change--Index

Instead, stage your change with:

git add foo

This is something that regularly trips up SVN users trying out Git. In SVN, you only add newly created files. In Git, you "add" every change. It's actually a more consistent approach and very logical when you get used to the idea. Remember that "add" always stages a change--moves it into the index--and it's the index that gets committed. Another "git status" at this point will show you a similar output to before, when you had staged the newly created file, showing that a new file and a modified file are both just considered "changes" that have to be incorporated.

Commit Change--Repository

Go ahead and commit now with:

git commit -m "My second commit"

Take note of the difference between the output from this commit and the previous one. It's somewhat briefer since this isn't the root-commit and there wasn't a new file created.

Seeing Your History

Now look at your commit history with:

git log

This brings up a really handy feature of Git. If the output of any Git command exceeds what will fit on one screen, Git automatically pipes the output through less, making it easily browsable. Once your history grows beyond a few commits, you'll see this behavior regularly with "git log".

In the commit log, you'll see for each commit 1) the unique identifier of the commit--a 40-hex-digit SHA-1 hash, 2) the author of the commit, 3) the timestamp of the commit, and 4) the commit message entered for the commit.

Removing Files from Version Control

Now for one more, trivial command:

git rm foo
git commit -m "I removed foo"

That's pretty self-explanatory, I think. It's what you do to delete files. Go ahead and add foo back, so that we can use it some more:

echo "new foo" >> foo
git add foo
git commit -m "Added foo back"

View Previous Versions

Establishing a history of things isn't very useful if you can't look back through the history. Here's the basic form of two commands to let you get a feel for a simple history. Later on, I'll do a whole post on navigating around a repository and seeing what's in it.

Show a Previous Version

Use "git log" to get the hash of a previous commit, and run:

git show <hash>:foo

That will show you the content of the file foo at that commit. Now use "git log" to pick two different commits, and run:

git diff <older hash> <newer hash>

You'll see the differences between the two versions in patch format. Added lines have a '+' in front, and deleted lines have a '-'.

I'm going to break this post here, though there's lots more to cover. By now, however, you know all you need in order to keep a simple, forward-moving history of any personal project you happen to be running. It's also handy for versioning configuration files if you run, for instance, a Squid proxy server or other local service with complex, text-based configuration. The next couple of posts will primarily cover branching and merging, where Git really shines in comparison to centralized version control.

First, branching: Part IV--Branching

Vote for this article on DZone.

Monday, July 13, 2009

The Journey to Git, Part II—Git Started

This second post in my Git series will just cover setting up Git so that I can keep these things fairly short and well-organized. From here on, everything is very hands-on. If you're serious about learning Git, I highly recommend you follow along with all of the steps to reinforce what you're reading. This is all going to be a command line walkthrough, though Git ships with some nice GUI tools as well. Once you're familiar with the command line and the concepts, the GUI tools should be easy to figure out.

TOC

Articles in this series:

Note: This is something of a rough draft right now without many links for reference. After posting it initially and seeing how long it was, I decided to come back and break it up in order to not scare off readers with short attention spans. You know who you are.

Note: For extensive information about any git command, you can use "git help <command>" to see the man page for it. Alternately, just google "git <command>", and the first link will probably always be the kernel.org HTML version of the man page.

On with the show...

Installing

First, obviously, install Git. On Linux, it's a simple package installation. On Windows, you can use either msysgit or install the Git package in Cygwin. I've used msysgit only in passing. Most of my experience is evenly split between Cygwin and Ubuntu. The only other thing I'll say here is you do not want msysgit and Cygwin installed at the same time (which is why I didn't use msysgit much). The two are not compatible, and you'll likely see some odd behavior in Cygwin if you do it (like scp thinking your home directory is "c:\program file\msysgit"). If you need further details on installing Git, email me, and I'll try to help.

If Git is properly installed, then you should be able to run:

git --version

Configuring

There's some configuration we can get out of the way up front. This is all optional, in fact, but at least the first one is certainly a good idea. Your configuration tool in Git is "git config". By default, this command updates a repository's configuration, but with the "--global" flag, it sets options for your user system-wide, so we can set up some preferences before we even have a repo. For more info about the command and a full list of configuration options, see the man page for "git config".

Identify Yourself

Let Git know who you are with:

git config --global user.name "your name"
git config --global user.email "your email"

This name and email are included in the information about every commit that you make.

Pick an Editor

There are a few things that Git needs to open a text editor for, like typing commit messages if you don't want to type them in-line. Give it the path to your preferred editor with:

git config --global core.editor <path>
Pick a Merge Tool

For some reason, I've never gotten into graphical merge tools, preferring instead to just view the raw conflict markers, but I know some people wouldn't even glance at a VCS that didn't offer the option of merge GUIs. Git knows about several Linux merge tools, and if you want to use one of those, you can just use:

git config --global merge.tool <tool>

Built in merge tools are listed in the man page. If you want to use something other than the built-ins, which you probably will if you're on Windows, it's a bit more work. Assuming you're using WinMerge:

git config --global merge.tool winmerge
git config --global mergetool.winmerge.cmd "WinMergeU \$MERGED"

If WinMerge isn't on your path, you should be able to specify the path to it with:

git config --global mergetool.winmerge.path <path>

However I couldn't get that to work. It's fine if you just leave off the path option and add it to your system path.

For Windows

If you're using one of the Windows Git flavors, you'll very likely want to set these options to handle line terminators more gracefully:

git config --global core.autocrlf true
git config --global core.safecrlf true

And if you're using msysgit, you may need:

git config --global core.fileMode false

I had trouble with it that this option fixed.

There are tons more configuration options documented in the man page, but these should get you started. In the next post, we'll finally get to basic, day-to-day commands that will let you start using Git productively: Part III--The Basics

Vote for this article on DZone.

The Journey to Git, Part I—Distributed vs Central VCS

This one has been a long time coming. I've been using Git as a client to Subversion at work now for a number of months. Over the weekend, I attended a session on Git at the NFJS conference, and it served to solidify some of the concepts in my mind. I think I'm ready to do one or more posts on the subject now.
This first post is going to be more an examination of distributed vs. centralized VCS than anything specific to Git itself. The remainder of the posts in this series of currently unknown length will be a quick start tutorial to get you up and running, teaching you how to use Git at a very basic level with just a single, local Git repository. When I was trying to get going with Git, I couldn't find a good reference online to get me up and running in the way I needed, so I'm going to try to make this as complete as possible. I assume zero Git knowledge here, but I do assume you're familiar with version control in general.

TOC

Articles in this series:

Why Git?

First, I'm going to examine why we even need to care about a new VCS. What does Git have that SVN doesn't? That really comes down to the fact that they're two different beasts. It's not so much Git vs. SVN, but distributed VCS (DVCS) vs centralized VCS.

Centralized VCS

With centralized version control, a development team is locked in in terms of workflow. You have a local copy of the project (or more than one). You can make changes to your heart's content, but everything is strictly local until and unless you commit back to the repository. Then whatever you did is visible to everyone. There's no in-between here, no compromise. I actually find it quite remarkable that developers the world over have been content with this model for so long. There are a few very immediate problems that come up.
Problems with Central Control
#1: Collaboration
Scenario: You're working on adding a new feature to an application and you want another developer to give you a hand with part of it or look over some code you think is questionable, or maybe he's just working on something else that interacts with your piece. Your code needs to be available for the other guy to work with. With a centralized VCS, your only options are to 1) commit your code, making it visible to everyone else, or 2) share it some other way outside of the VCS. Option 2) is just messy, if workable. Option 1) would probably be preferred for the most part. The problem here is that your code could be in very early development, possibly even broken. We all know there's a horrible taboo on committing broken code, and rightly so with central control. So you're stuck with not being able to share or with having to stop work to make things non-broken to the point that you can safely commit before you can collaborate on a piece of code.
#2: Releases
This problem is significantly more insidious. Any time you commit something to a centralized VCS, you're essentially affirming that this feature will absolutely be included in the next release. This assumes, of course, that you're not doing work on your own branch specific either to your feature or yourself--a whole other worm can with centralized version control. As your release process becomes more streamlined and automated (that's happening, right?) and your release cycles get shorter because you're aiming to be a more agile--that's agile rather than necessarily Agile--team (that's happening, too, right?), it becomes more important to not schedule features for a release, but to schedule a release for a date and put in whatever features are finished on that date. This means it's imperative that unfinished features never be committed to your mainline development branch, like trunk, since you can't know exactly when a feature will be ready for release. Instead, features should be developed in isolation and only moved into the "trunk" branch when completely finished. This leads back to the branch-per-feature idea, which central VCS in general and SVN in specific just aren't designed for.
#3: History
This one's simple and may be less compelling to some. It's a big hangup for me. A VCS is all about keeping track of history, right? Well, who ever said history was perfect? Only a centralized VCS tries to enforce that by demanding that every commit be 100% bug-free since everyone is going to see it. This leads to larger, less frequent commits, potentially as much as days apart. Days! Why have we come to accept this as the norm? I say version control should be no more than an hour behind you at worst. Small, incremental changes are better. VCS should be a coarse-grained undo system.

Distributed VCS

How does a DVCS answer these concerns? Laying out the problems took quite a bit of space, but interestingly, they're all solved by this fundamental difference: in a DVCS, you not only have your own working copy, you in fact have your own copy of the entire repository, or at least the pieces you care about. You can change anything about it in any way, and nothing leaks out to anyone else unless 1) you push it out or 2) someone else pulls it from you intentionally. This solves all three of my problems since nobody, including a release, will see what you've done until you're ready for them to, but at the same time, your changes are available to anyone who's interested.
Now that you know why a DVCS is a good idea, let's get back to Git itself. There seem to be three mainstream DVCSs out there today: Git, Mercurial, and Bazaar. Of the three, Git seems to be most widely used and gaining traction. That's not necessarily a reason to use it, but Git is what I've used, so it's what I'm writing about.
See the next post to begin a walkthrough tutorial of Git: Part II--Git Started.
Vote for this article on DZone.

Sunday, July 12, 2009

NFJS the Beginning

I just got home from the Austin, TX No Fluff Just Stuff conference. Wow! It's the first one I've been to, and it was great. For my reference and yours, I'm going to put some notes here of things that bear remembering. I expect I'll make several posts over the next few days (or more) on topics covered at the conference. Today:

Open Source Debugging Tools

One of the really good sessions I attended was on a variety of open source tools useful for debugging java apps. It's getting to the point, or maybe it's already there, where there's almost no benefit to shelling out the cash for a commercial debugging/profiling application like YourKit. Think “tools” in a loose sense here, because some of these things are just handy CLI commands that make your debugging efforts more efficient. If you're too lazy or don't have time to ready the whole thing, the "must-see"s for me in this list were OQL (in jhat or Eclipse Memory Analyzer--I don't think VisualVM supports it yet), omniscient debuggers, and BTrace.
I believe all the tools I mention are available for Windows, Linux, or Mac unless otherwise stated. Certainly the Java-based ones are. It started off with tools that aren't actually Java-oriented, but instaed are for network debugging:
  • cURL – absolutely raw HTTP interaction. This guy is great for debugging any of the many protocols that use HTTP: normal web requests, AJAX, SOAP, web services, etc. You can specify any request method, parameters, body, headers, etc. Also has modes for HTTPS, FTP(S), SCP, and more.
  • Tcpdump – watch raw network traffic at the packet level. This tool shows some or all of the content of packets passing through the network interface of your computer. It's highly configurable to capture only exactly what you want.
  • Wireshark – GUI for viewing tcpdump output, formerly known as Ethereal. Wireshark also includes tcpdump and can be used to do the captures. Often, though, you're working in a headless environment, in which case you'd need to use tcpdump and ship the dump file back to where you can analyze it with the excellent Wireshark interface.
Now to the Java tools. A “**” indicates that a tool is distributed with the JDK since version 1.5. I got tired of typing it over and over:
  • Jmeter – Apache Jmeter has come a long way. It's primarily a load testing tool that can run against a number of different types of servers that one typically encounters in Java enterprise development. The interface isn't entirely intuitive, but don't give up too quickly. It is extremely flexible and powerful.
  • Jps** – like ps in Linux, but shows Java-specific details about all running JVM processes.
  • Jstat** – shows JVM statistics for a particular JVM. This shows mostly memory- and garbage-collection-related stats.
  • Jconsole**/VisualVM – graphical tool that shows tons of statistics about a JVM as well as JMX Mbeans. They can also initiate and analyze heap dumps. Jconsole is being deprecated in favor of VisualVM. While VisualVM is actually included with the JDK, it's recommended to download it separately since development is ongoing, and the version from the website will likely always be way ahead of the version in your JDK.
  • Jstatd** – a daemon that allows you to connect jps, jstat, and jconsole/VisualVM to remote machines. I.e. run jstatd on the remote machine, and then you can connect one of the other tools to it.
  • Jstack** – shows stacktraces of all threads in a running JVM.
  • Jmap** – displays/dumps content of the Java heap. A heap dump can be very large and expensive, but invaluable in diagnosing performance problems and OOMEs. Dumps are created in a standard format that can be read by a number of tools.
  • Jhat** – analyze and display—actually run an HTTP server that serves the results of—a heap dump. Jhat is quick and effective (given enough of its own heap space to run in!), but since it exposes everything as HTML, it's somewhat limited in comparison to Jconsole/VisualVM or another tool that can parse heap dumps.
  • OQL – not a tool, per se, but I only learned about it this weekend, and it's awesome! Object Query Language is a language for letting you query a heap dump to find out virtually anything you could possibly want to know in one, neat result set! Simple example: “select z from java.lang.String z where z.count > 50” – applied to a heap dump will display all String instances with more than 50 characters. I believe a number of tools, including jhat, support OQL, but I haven't researched this enough yet.
  • Jinfo – display or change JVM options while a JVM is running. For instance, turn on HeapDumpOnOutOfMemoryError on every important JVM you're currently running because it could help you immensely in diagnosing crashes.
  • Javap – class file disassembly. Given any java class file, show a human-readable representation of the bytecode. If the class was compiled without debug information, it will be somewhat less readable.
  • Eclipse Memory Analyzer – Eclipse plugin that can load (but not initiate) heap dumps and run OQL queries against them and has some quite friendly and useful visual representations of the dump.
  • BTrace – a utility that lets you easily inject your own code into a running JVM by dynamically instrumenting bytecode. The code is quite simple. You just have to write a static method and put an annotation on it telling it when to run. Note: the code you can put in a BTrace class is extremely limited. It seems to be mainly intended for creating more verbosity in an application when things aren't working quite right and your logging isn't telling you what you need to know.
  • “Omniscient” debuggers – We didn't have time to get into these, but they look crazy powerful. The idea with these is that the debugger knows everything and will let you do anything, including stepping backward through a program, to figure out where a bug came from. They do all this by recording every single thing that happens in the JVM. Thus while they're crazy powerful, they're also crazy slow, but still... A couple that were mentioned: ODB and TOD.
Finally, a few more non-java tools that deal with general OS stuff:
  • fs_usage (Mac), inotifywait, inotifywatch, lsof (Linux) – tools for watching changes to files and/or seeing what processes are changing files.
  • Process Monitor (Windows) – lets you see what processes are changing files and much, much more for Windows. Not open source, but free. This tool is like Task Manager on steroids.
  • FileMon (Windows) – close cousin of Process Monitor that shows lots more file-related information in Windows. Also not open source, but free.

Wednesday, July 8, 2009

How To Aggregate Downstream Test Results in Hudson

I must've googled a dozen permutations of this post's title looking for the key to my problem of... well, what it says. How on earth do I get Hudson (a top-notch CI server) to "Aggregate Downstream Test Results"??? The little help icon, which is typically very useful, provided a handy description of the feature, but didn't say anything about how to use it. My searches found several posts from other people having the same problem, but little in the way of resolution. I'm not sure if there's just not many people wanting to do this or if the "how" was just very obvious to everyone but me, but one way or the other, I finally figured it out. In retrospect, it does seem a bit obvious.
The key to getting Hudson to track test results from downstream builds is that all of the builds, upstream and down, must have a common, fingerprinted artifact. Any old artifact will do...almost. More on that later. I just had my first job--let's call it Foo--create a file with the date and time in it called "starttime".
1) Have Foo archive this file as an artifact, and either enter the file in the list of artifacts to fingerprint, or turn on fingerprinting for all artifacts.
2) In the downstream job, Bar, do exactly the same as in step one for Foo: archive and fingerprint the same file. Of course this implies that Bar has actually gotten the file in question somehow. I haven't worked out how to do this through Hudson other than by using wget to retrieve the archived one from the last build of Foo: something like "wget http://localhost:8080/job/Foo/lastBuild/artifact/starttime".
3) Of course, you'll need to have some test output being recorded by at least Bar, if not both jobs, since the whole point of this is to see aggregated test results.
4) You can always see the very latest test results at http://localhost:8080/job/Foo/lastSuccessfulBuild/aggregatedTestReport. Don't use "lastBuild" or the page will be unavailable when Foo is running or broken. When you have a lot of downstream tests, you can enable auto-refresh on this page and see the results fill in as jobs complete--very cool, especially when you have a nice test cluster with lots of downstream jobs.
5) It's handy to make Bar triggered by a build of Foo, but it's by no means required. From now on, any builds of Foo and Bar that share an identical artifact (checked by md5 checksum) are linked together, and the link appears in several places, one of which is the aggregated test results.
I emphasized the word "any" there for the same reason I said you can use almost any artifact. Remember it's the md5sum of the artifact in question that is used to determine which builds of which jobs are linked together. Say you start builds Foo #4, #5, and #6, and all create or use the same "starttime" file for some reason--maybe you're using just the date instead of date + time. Then Bar #4, #5, and #6 also all retrieve, archive, and fingerprint the same file. Since the checksum of the artifact is common in all those builds, they'll all be linked together, and Hudson can't really tell them apart. I'm not sure what this does for aggregated test results, but in other places, like where upstream and downstream builds are listed, instead of showing that Foo #4 led to Bar #4, it'll show Foo #4 led to Bar #4-#6.
So there ya go. Let 'er rip. I'm happily chugging away now with a 5-node Hudson cluster churning out test results like there's no tomorrow.