Book review – The Passionate Programmer

The Passionate Programmer by Chad Fowler describes various techniques to boost your career as a software developer or programmer. Chad compares your career with a product you want to sell. In this case, the product is you. You have to find the right market for you. Meaning, that you know how supply and demand work in your market. How can you react on changes in your market? This is best done by being a generalist and a specialist together. As soon as your special knowledge is no longer needed, you can use your general skills to survive and find your next niche to survive.

Further on, when you know your market, you need to create a product. As a software developer, you solve problems of your customers. By knowing the business of your customers, you can easier understand their needs. They enjoy working with you. At the lifetime of your product, it is useful to be guided by a mentor who knows the business. It is also helping to be a mentor to others, because teaching increases your knowledge and abilities. Another way of learning new skills and improving your product is by practice. The more you practice in playgrounds, the better your product gets.

Practising in playgrounds is not enough for a successful product. You must show that your product works in real life. Go out and get some real projects done. Concentrate on the current project and your current job, but also keep the goals of your company in mind. Leave your comfort zone to be reliable in stressful situations. This increases the trust of you and others in your abilities.

Now you know how to perform in real life. This is the time to sell your product the right way. Marketing yourself is more brand building than just creating a product. The product must speak for itself by being remarkable. Ensure, your management knows you and understands how you perform.

In the end, you have to think about your far away goals. You need to evolve constantly. Like in agile software development, develop your career in an agile way. Check regularly your current situation and in which direction you are heading. Is this still the right direction?

This is a short summary of The Passionate Programmer. For me many of the techniques sound useful. The Apprenticeship Patterns by Dave Hoover and Adewale Oshineye points in the same direction. Some techniques found in The Passionate Programmer can be found there, too. I started to install some techniques of both books in my daily work as habits. Therefore, I recommend new software developers to read The Passionate Programmer. It will boost your daily productivity.

Check out the following links to read more about the topic.

Advertisements

Resizing VirtualBox drives

In my day to day work, I use a hand full of virtual machines to handle different tasks. Normally one for each purpose. One of those machines hosts a confluence system containing all the useful information and thoughts. As I initially set up the machine, I assigned the machine a certain amount of disk space. Some days ago, confluence started to be really slow when saving content. I looked up the disk space and the partition was 100% full. As I did not want to install confluence on a new machine with more disk space, I searched how the disk size could be increased.

System setup

Confluence is installed on a virtual machine powered by VirtualBox. The guest system uses Ubuntu 16.04 LTS while the host machine is a Windows 10 machine. The guest system uses a logical volume manager (lvm). As explained on ubuntuusers.de the lvm is an abstraction between partitions and the file system. It allows combining several physical partitions to one volume group. Inside a volume group, logical partitions (logical volumes) can be created. This is helpful in server environments where new disks are added during lifetime of the server. The new disks can be added to the volume group and the logical volume can be extended.

Increase disk space

This setup results in several levels where the disk space can be configured.

  1. Size of Virtual Disk Image (VDI)
  2. Partition size of the physical partition of the virtual machine
  3. Partition size of the lvm partition
  4. Size of the logical volume
  5. Size of the file system of the logical volume

The changes following in the next sections will change the disk of your virtual machine. This might result in data loss. Therefore, take a backup of your virtual machine.

Increase disk size of virtual disk image

The size of a virtual disk image can be increased with VirtualBoxs VBoxManage command is used for this. The virtual machine has to be turned off.

VBoxManage modifyhd YOUR_HARD_DISK.vdi --resize SIZE_IN_MB

YOUR_HARD_DISK points to the vdi file containing the disk of the virtual machine. SIZE_IN_MB specifies the new size of the virtual disk image. More on that can be found on askubuntu.com.

Increase volume size using gparted

After increasing the container file, the new available space has to be appended to the partitions of the virtual machine. This is done using GParted and is not possible when the partition is used. In case the virtual machine has only one partition which is used by the operating system, it is necessary to boot from a live CD/DVD/ISO to change partitions. GParted provides one, otherwise an ubunut live ISO also contains GParted.

Boot into one of the live ISOs and start GParted. It will list all available selections. Select the hard drive you want to change in the upper right corner. Deactivate all partitions you want to change using the context menu. Afterwards increase the size of the physical partition and of the lvm partition using the context menu (Resize/Move). howtogeek shows it a bit more detailed.

Increase volume size

Now as the physical partitions are increased, it is time to boot into the normal operating system of the virtual machine and increase the logical volume. First it is necessary to know how the volume group is called. The pvdisplay command will print this information.

sudo pvdisplay

The output will show a PV Name – name of the physical partition – and a VG Name – name of the volume group – together with some other information. In this case it is ubuntu-vg.

To change logical volumes, a various number of commands exists, ubuntuusers.de lists them. The one to increase the size of a logical volume is lvextend.

sudo lvextend -l +100%FREE /dev/ubuntu-vg/root

The parameter -l +100%FREE is used to specify the new size of the volume. In this case all free space is added (“+” sign) to the existing volume. All the space has to be added to the root volume of the ubuntu-vg volume group.

Afterwards, the size of the file system of the logical volume has to be increased. The resize2fs command is used for this.

sudo resize2fs /dev/ubuntu-vg/root

More information about this part can be found on hiroom2.com.

Conclusion

Dealing with a virtual machine running out of memory is not a big issue. It is possible to increase the disk space after the creation of the machine. Even, when the machine is completely full. My confluence machine is now up again. The new disk space sped it up.

Caching in Gradle

Setting up a CI pipeline is nowadays a standard in software development. In our case, we use jenkins as a CI server with one master and one slave. The master builds our artifacts, uploads them to a repository and the slave is used to execute regression tests. As mentioned on Wikipedia, a regression test verifies that software, which was previously developed and tested still performs correctly after it was changed. As also mentioned in the article, a regression test can be used when some feature is redesigned, to ensure that some of the same mistakes that were made in the original implementation of the feature are not again made in the redesign. This applies quite well to my current situation.

Reducing legacy costs

In the past few weeks, I replaced our source for input data from a database based approach to a file based one which provides more flexibility and is also a bit faster than the old one. When developing simulation software there is one basic rule, if you do not change the input or implemented semantic, the output must stay the same.

So replacing the mechanism to load input data, should not change the output. Doing this in a legacy environment means that we do not have a good test coverage. Therefore, to preserve this behaviour a couple of simple regression tests are used which basically compare the simulation output using the old implementation with one using the new one. Each iteration of the regression tests takes round about 55 minutes to complete, so it is possible to run it 7 to 8 times a day.

Our CI pipeline handles this for us. The code is build and uploaded to a file-based repository on our file server. Afterwards the regression tests are triggered and gradle uses the latest artifacts for the tests. Nothing special here, it looks like a normal CI pipeline.

Problems with the repository

During development, it happened from time to time, that the regression tests failed with a NoClassDefFoundError pointing to our main class, which I did not understand at first, because the class was not changed and it was definitely there.

Taking an eye on that phenomenon revealed, that builds during lunch always succeeded except when there was a real bug. While builds during worktime sometimes succeeded and sometimes failed. It looked like when the artifacts were build and uploaded to the repository and one of the regression tests was started in that moment, the test failed.

Not all local resources are local

As mentioned earlier, gradle is used for the build. Gradle has built-in support to cache dependencies in a local folder after downloading them from a remote repository. Gradle can also display from where the dependencies are taken from during build, see stackoverflow.

task printDeps {
  doLast {
     println "Dependencies:"
     configurations.runtime.each { println it }
  }
}

Adding the above task to the build file and executing it shows all dependencies and the path to each of them. In my case, for the most of the dependencies, the path pointed to the local gradle cache. For the artifacts, which are located in the file-based repository, the path directly pointed to the file server instead of the local cache.

Digging deeper into this revealed, that gradle considers all file-based repositories as local. Local repositories are not worth to be cached. So the dependency is directly used from that location, even if it points to a server. As mentioned at gradle.org, this is hard coded into gradle. There were also some feature requests for ivy and maven repositories to make this behaviour configurable. However, they sadly did not survive the migration to github.

Finding our way out

This behaviour is only hard coded for file repositories. So one solution could be to switch to a binary repository like artifactory or nexus. Nevertheless, this has the drawback to maintain another server, which provides in our case little added value compared to the file server solution.

Another solution is to download and cache the dependencies manually in the build script. This can be done by adding a dedicated task to the build script, which copies the file-based repository into a local folder. This will always copy the whole repository, which can increase your build time and network load. One could add some caching logic, but this will just reinvent the wheel.

task syncDependencies(type: Sync) {
  group = 'build setup'
  from project.ext["mobitopp.repository.url"]
  into project.ext["local.cache.path"] as File
}
compileJava.dependsOn syncDependencies

In our case, the build time did not significantly increase and it is compared to the maintenance costs of another server easier to handle for us.

Conclusion

Watch out where your build tool loads the artifacts from. Be sure to have builds, which do not affect each other.

Repeatability in software development

Developing software has compared to other engineering disciplines a great advantage in testability. We can automatically test our whole product within a short period of time and after every change we did. Comparing this to quality testing, for example in mechanical engineering, reveals, that we can save a lot of time and test more often even during development. This provides us a quality assurance with high performance compared to other disciplines.

Repeatability in tests

To gain this performance we have to write tests with certain properties. Andy Hunt and David Thomas, and in the newer version also Jeff Langr, describe in their book Pragmatic Unit Testing the A-TRIP or FIRST properties of tests. Both sets are comparable and both contain repeatability, which provides reliable results between test runs. This is a property which is also required in simulations.

Repeatability in simulations

Given the same inputs, and the same version, a simulation must produce the same output. In fields, where simulations should cover a certain amount of uncertainty, like in traffic simulations, randomness is introduced to model human decision making. The simulations are designed as a kind of a monte carlo experiment.

As randomness in general is not repeatable, pseudo randomness is used. This means, a random number generator with a specified seed is used to provide reproducible experiments. As long as the seed is equal, the simulation should produce the same output. After the seed has been changed, the simulation might produce another output.

Using controlled random number generators is one aspect to reproduce results of earlier experiments. Another aspect is avoiding to use data structures, that store data in an uncontrolled way, like HashMaps. As HashMap might change to order of the stored objects during a rehash. Due to this, the iteration order at different times during the execution of the program might be different. This is also mentioned in the JavaDoc comment.

This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

On the other side, HashMap is based on the hashCode Method of Object to store and distribute the objects in the internal data structure. As mentioned in the JavaDoc of hashCode, a hashCode must not be equal for the same object at different executions of the same application.

This integer need not remain consistent from one execution of an application to another execution of the same application.

The first aspect might not corrupt repeatability as long as the elements are added to the map in the same way and the rehashing does not change between the executions. The second aspect is only relevant when the application iterates over the map and might corrupt repeatability. In case only the lookup mechanism of the map is used, HashMap is just fine.

Alternatives

There are several alternatives which provide repeatable iteration order. When using comparable keys with a natural order, one can use TreeMap, which implements SortedMap. Entries implementing Comparable are sorted based on the compareTo method or a given Comparator. As long as the compare mechanism stays the same, the results will be repeatable.

If there is no natural order of elements or no order could be defined, one could use a LinkedHashMap. LinkedHashMap does not rely on comparable objects, but can store objects in the order they have been added. This results in repeatable simulation experiments as long as the input data is stored in the same order.

Conclusion

When your application must produce the same output given the same input, think twice which data structures you use. In case you want to iterate over the entries or keys of a Map, use an implementation which will provide a repeatable iteration order, like TreeMap or LinkedHashMap. The same applies in tests. Otherwise your tests may run on your machine, but fail on another one. So be sure to use the right data structure for the right task.

Readable configuration files

From time to time it may be necessary to develop a software where you have to configure each run of the software or use a predefined configuration to run it. Normally there should be a UI, which the user can use to create such a configuration and hand it over to the software. But in case the effort of creating a user friendly UI is too high or the user does not want to have one, it is essential to choose a configuration format that can be edited quite easy. Looking at my current work, I have to provide a configuration format which will setup the a travel demand simulation, including every needed input data.

Taking a look at wikipedia you can find several possible formats which can be used to serialise the data. As I have worked with XML in the past and I got messed up with all those tags, like Jeff Atwood, I decided to take a look at alternative formats, like YAML. It was designed to be easy readable and until now I can agree with that.

As stated above I need to define a configuration format for a travel demand simulation. Historically, our tool uses matrices for costs and travel time as input. Those matrices can differ over time and travel mode, e.g. you can have one which is valid between 0 AM and 1 AM and another one being valid between 1 AM and 3 AM for travelling by car. If you want to go by bus, there are also different matrices, e.g. one being valid between 0 AM and 2 AM and one which is valid between 2 AM and 3 AM. So you can have several matrices per travel mode. Using YAML one can specify the input files in a data driven approach using nested sets of key value pairs.

travelTime:
  - mode: car
    matrices:
      - from: 0
        to: 1
        path: path/to/0-to-1.file
      - from: 1
        to: 3
        path: path/to/1-to-3.file
  - mode: bus
    matrices:
      - from: 0
        to: 2
        path: path/to/0-to-2.file
      - from: 2
        to: 3
        path: path/to/2-to-3.file
cost:
...

This is a classical approach of specifying configuration data like it was done in XML, except that only one parameter or key-value-pair is stated per line. As stated above, YAML was design to provide a more human readable serialisation format. So we can change the serialisation of the resulting configuration a bit more.

travelTimeUsing:
  car:
    between:
      0 to 1: path/to/0-to-1.file
      1 to 3: path/to/1-to-3.file
  bus:
    between:
      0 to 2: path/to/0-to-2.file
      2 to 3: path/to/2-to-3.file

In this case, the data structure is based on maps where car and bus are keys for travel modes and 0 to 1, 1 to 3, 0 to 2 and 2 to 3 are keys for the time spans and matrix files. The syntax in this case can be combined to answer the question: Where will the travel time using a bus at 1 AM will read from?

travelTimeUsing: car: between: 0 to 1: path/to/0-to-1.file

Ignoring punctuation this can be written like.

Travel time using car between 0 to 1 path/to/0-to-1.file.

To build a better sounding sentence one could also add another will be read from after 0 to 1 which results in:

Travel time using car between 0 to 1 will be read from path/to/0-to-1.file.

But I think this last addition inflates the configuration with too much boilerplate text.

In my case this serialisation format did not consume a lot of time to develop and my users where quite happy with it. They can use normal text editors to change their configurations. Some text editors additionally can support them with syntax highlighting or folding of YAML files.

Using text editors to change the configuration instead of a dedicated UI or XML has the drawback, that I could not find a comparable mechanism like XML-Schema for YAML, which is supported by editors. So you have to know the syntax of the configuration. In my case my users could write their configuration files after a look on a small example. But nonetheless, sooner or later we may need to develop a UI to configure simulations to improve the user experience. Until there we have a quite readable configuration format.

Stumbling over the Liskov Substitution Principle

Today I stumbled over a problem living in java for a long time. Iterating elements of a Java Stack will not be done in the order I expected. A Stack in Java is a Vector with additional push, peek and pop methods. As a Vector is a List and a List is an Iterable, the elements of a Stack can be processed using a for-each loop. Inheriting the behaviour from Vector means, that all elements are processed in the order they got added to the Stack. But this is not the order I expected a Stack would be iterated. The following code shows this behaviour.

Stack stack = new Stack<>();
stack.push("first");
stack.push("second");
for (String element : stack) {
  System.out.println(element);
}

Output:

first
second

Taking a look at the web, especially stackoverflow, reveals, that I am not the only guy requesting another order while iterating a Stack. Looking at the java bug tracker provides the reason for the current behaviour. The Stack class inherits the Vector class. Resulting in inherited methods and behaviour which cannot be deactivated. Looking at another piece of code shows this a bit more practical.

Stack stack = new Stack<>();
stack.push("first");
stack.push("second");
stack.add(1,"third");

for(String element : stack) {
  System.out.println(element);
}

Output:

first
third
second

As mentioned in the bug tracker, this behaviour violates the Liskov Substitution Principle, because a Stack does not behave like a Vector, so it should not inherit Vector. In the bug tracker is also mentioned, that this design decision was not a good one. But it has been taken and now we have to live with it. Additionally the JavaDoc comment of the Stack class tells us to use Deques as more complete implementations of a Stack.

Long story short, using a Deque as a Stack in Java looks like the following.

Deque stack = new LinkedList<>();
stack.push("first");
stack.push("second");

for(String element : stack) {
  System.out.println(element);
}

Output:

second
first

In the future I will keep an eye on the classes I use, especially whether an implementation fits the concept it implements or not.

Java Forum Stuttgart – Part 3

This is the last post of my visit at the Java Forum Stuttgart. In Part 1 and Part 2 I described the other talks I attended at the JFS 2016. In this post I will present the remaining talks I attended.

Erhöhe i um 1

This topic was a replacement for another talk where the speaker was not able to attend the conference. Michael Wiedeking again gave an entertaining talk about comments in code, especially comments like i = i + 1; // Increase i by 1. He also discussed the difference between API documentation, like JavaDoc, and normal in-line comments. His resume was that instead of writing comments, one should invest time in better readable names.

Another interesting part of the talk covered different types of interfaces. He splits interfaces into three types.

  1. unchangeable public interfaces
  2. changeable public interfaces
  3. private interfaces

Type 3 is the least problematic one. This type is only used to encapsulate different parts of our software internally. Changing parts of type 1 interfaces is like changing a normal class. It is just a refactoring, because the developer checked out all usages of the interface. Type 2 interfaces are used in-house or by a small number of users, which are known by the developer. A change in this kind of interface is a bit more problematic, but with good reasons it is acceptable, because only few people have to change their software. Nonetheless it should be avoided. Type 1 interfaces are the most problematic ones, because they are published to a wide audience and used by a lot of developers world wide. A good example for this is the JDK. Changing interfaces or the visibility of interfaces of type 1 is nearly impossible. Every change of an interface of this type will break a huge number of builds and is therefore not acceptable.

Was jeder Java-Entwickler über Strings wissen sollte

This talk was held in the fashion of What every Java Programmer should know about Floating Point Arithmetic and revealed some interesting insights of Strings in Java. Before presenting those insights, the speaker gave a short introduction into measuring the performance of Java programs. This part is mainly based on blog posts from Antonio Goncalves, the book Java Performance written by Scott Oaks, and Quality Code written by Stephen Vance. Performance in Java is best measured using the Java Microbenchmarking Harness which is developed with the OpenJDK. It allows to analyze programs in scales down to nano- and microseconds and provides support to warm up the JIT compiler.

After this introduction to measuring performance in Java, the presenter shows the impact of String#intern. This function moves the content of a String into a StringTable and only saves a reference to the content. Due to this, two Strings having the same content, only need the memory space one time for the content and two times for the references to the content. Depending on the application, this could reduce the memory footprint significantly. If you want to analyze this, you can use –XX:-PrintStringTableStatistics as a command line argument. Together with the introduction of the G1 garbage collector (-XX:+UseG1GC), the String deduplication could be activated by -XX:+UseStringDeduplication.

This and that

Between the talks and on the way to and from the Java Forum there were a lot of other interesting talks. All in all it was a nice experience and I will reserve the date for the next Java Forum in my calender.