Caching in Gradle

Setting up a CI pipeline is nowadays a standard in software development. In our case, we use jenkins as a CI server with one master and one slave. The master builds our artifacts, uploads them to a repository and the slave is used to execute regression tests. As mentioned on Wikipedia, a regression test verifies that software, which was previously developed and tested still performs correctly after it was changed. As also mentioned in the article, a regression test can be used when some feature is redesigned, to ensure that some of the same mistakes that were made in the original implementation of the feature are not again made in the redesign. This applies quite well to my current situation.

Reducing legacy costs

In the past few weeks, I replaced our source for input data from a database based approach to a file based one which provides more flexibility and is also a bit faster than the old one. When developing simulation software there is one basic rule, if you do not change the input or implemented semantic, the output must stay the same.

So replacing the mechanism to load input data, should not change the output. Doing this in a legacy environment means that we do not have a good test coverage. Therefore, to preserve this behaviour a couple of simple regression tests are used which basically compare the simulation output using the old implementation with one using the new one. Each iteration of the regression tests takes round about 55 minutes to complete, so it is possible to run it 7 to 8 times a day.

Our CI pipeline handles this for us. The code is build and uploaded to a file-based repository on our file server. Afterwards the regression tests are triggered and gradle uses the latest artifacts for the tests. Nothing special here, it looks like a normal CI pipeline.

Problems with the repository

During development, it happened from time to time, that the regression tests failed with a NoClassDefFoundError pointing to our main class, which I did not understand at first, because the class was not changed and it was definitely there.

Taking an eye on that phenomenon revealed, that builds during lunch always succeeded except when there was a real bug. While builds during worktime sometimes succeeded and sometimes failed. It looked like when the artifacts were build and uploaded to the repository and one of the regression tests was started in that moment, the test failed.

Not all local resources are local

As mentioned earlier, gradle is used for the build. Gradle has built-in support to cache dependencies in a local folder after downloading them from a remote repository. Gradle can also display from where the dependencies are taken from during build, see stackoverflow.

task printDeps {
  doLast {
     println "Dependencies:"
     configurations.runtime.each { println it }
  }
}

Adding the above task to the build file and executing it shows all dependencies and the path to each of them. In my case, for the most of the dependencies, the path pointed to the local gradle cache. For the artifacts, which are located in the file-based repository, the path directly pointed to the file server instead of the local cache.

Digging deeper into this revealed, that gradle considers all file-based repositories as local. Local repositories are not worth to be cached. So the dependency is directly used from that location, even if it points to a server. As mentioned at gradle.org, this is hard coded into gradle. There were also some feature requests for ivy and maven repositories to make this behaviour configurable. However, they sadly did not survive the migration to github.

Finding our way out

This behaviour is only hard coded for file repositories. So one solution could be to switch to a binary repository like artifactory or nexus. Nevertheless, this has the drawback to maintain another server, which provides in our case little added value compared to the file server solution.

Another solution is to download and cache the dependencies manually in the build script. This can be done by adding a dedicated task to the build script, which copies the file-based repository into a local folder. This will always copy the whole repository, which can increase your build time and network load. One could add some caching logic, but this will just reinvent the wheel.

task syncDependencies(type: Sync) {
  group = 'build setup'
  from project.ext["mobitopp.repository.url"]
  into project.ext["local.cache.path"] as File
}
compileJava.dependsOn syncDependencies

In our case, the build time did not significantly increase and it is compared to the maintenance costs of another server easier to handle for us.

Conclusion

Watch out where your build tool loads the artifacts from. Be sure to have builds, which do not affect each other.

Repeatability in software development

Developing software has compared to other engineering disciplines a great advantage in testability. We can automatically test our whole product within a short period of time and after every change we did. Comparing this to quality testing, for example in mechanical engineering, reveals, that we can save a lot of time and test more often even during development. This provides us a quality assurance with high performance compared to other disciplines.

Repeatability in tests

To gain this performance we have to write tests with certain properties. Andy Hunt and David Thomas, and in the newer version also Jeff Langr, describe in their book Pragmatic Unit Testing the A-TRIP or FIRST properties of tests. Both sets are comparable and both contain repeatability, which provides reliable results between test runs. This is a property which is also required in simulations.

Repeatability in simulations

Given the same inputs, and the same version, a simulation must produce the same output. In fields, where simulations should cover a certain amount of uncertainty, like in traffic simulations, randomness is introduced to model human decision making. The simulations are designed as a kind of a monte carlo experiment.

As randomness in general is not repeatable, pseudo randomness is used. This means, a random number generator with a specified seed is used to provide reproducible experiments. As long as the seed is equal, the simulation should produce the same output. After the seed has been changed, the simulation might produce another output.

Using controlled random number generators is one aspect to reproduce results of earlier experiments. Another aspect is avoiding to use data structures, that store data in an uncontrolled way, like HashMaps. As HashMap might change to order of the stored objects during a rehash. Due to this, the iteration order at different times during the execution of the program might be different. This is also mentioned in the JavaDoc comment.

This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

On the other side, HashMap is based on the hashCode Method of Object to store and distribute the objects in the internal data structure. As mentioned in the JavaDoc of hashCode, a hashCode must not be equal for the same object at different executions of the same application.

This integer need not remain consistent from one execution of an application to another execution of the same application.

The first aspect might not corrupt repeatability as long as the elements are added to the map in the same way and the rehashing does not change between the executions. The second aspect is only relevant when the application iterates over the map and might corrupt repeatability. In case only the lookup mechanism of the map is used, HashMap is just fine.

Alternatives

There are several alternatives which provide repeatable iteration order. When using comparable keys with a natural order, one can use TreeMap, which implements SortedMap. Entries implementing Comparable are sorted based on the compareTo method or a given Comparator. As long as the compare mechanism stays the same, the results will be repeatable.

If there is no natural order of elements or no order could be defined, one could use a LinkedHashMap. LinkedHashMap does not rely on comparable objects, but can store objects in the order they have been added. This results in repeatable simulation experiments as long as the input data is stored in the same order.

Conclusion

When your application must produce the same output given the same input, think twice which data structures you use. In case you want to iterate over the entries or keys of a Map, use an implementation which will provide a repeatable iteration order, like TreeMap or LinkedHashMap. The same applies in tests. Otherwise your tests may run on your machine, but fail on another one. So be sure to use the right data structure for the right task.

Stumbling over the Liskov Substitution Principle

Today I stumbled over a problem living in java for a long time. Iterating elements of a Java Stack will not be done in the order I expected. A Stack in Java is a Vector with additional push, peek and pop methods. As a Vector is a List and a List is an Iterable, the elements of a Stack can be processed using a for-each loop. Inheriting the behaviour from Vector means, that all elements are processed in the order they got added to the Stack. But this is not the order I expected a Stack would be iterated. The following code shows this behaviour.

Stack stack = new Stack<>();
stack.push("first");
stack.push("second");
for (String element : stack) {
  System.out.println(element);
}

Output:

first
second

Taking a look at the web, especially stackoverflow, reveals, that I am not the only guy requesting another order while iterating a Stack. Looking at the java bug tracker provides the reason for the current behaviour. The Stack class inherits the Vector class. Resulting in inherited methods and behaviour which cannot be deactivated. Looking at another piece of code shows this a bit more practical.

Stack stack = new Stack<>();
stack.push("first");
stack.push("second");
stack.add(1,"third");

for(String element : stack) {
  System.out.println(element);
}

Output:

first
third
second

As mentioned in the bug tracker, this behaviour violates the Liskov Substitution Principle, because a Stack does not behave like a Vector, so it should not inherit Vector. In the bug tracker is also mentioned, that this design decision was not a good one. But it has been taken and now we have to live with it. Additionally the JavaDoc comment of the Stack class tells us to use Deques as more complete implementations of a Stack.

Long story short, using a Deque as a Stack in Java looks like the following.

Deque stack = new LinkedList<>();
stack.push("first");
stack.push("second");

for(String element : stack) {
  System.out.println(element);
}

Output:

second
first

In the future I will keep an eye on the classes I use, especially whether an implementation fits the concept it implements or not.

Java Forum Stuttgart – Part 3

This is the last post of my visit at the Java Forum Stuttgart. In Part 1 and Part 2 I described the other talks I attended at the JFS 2016. In this post I will present the remaining talks I attended.

Erhöhe i um 1

This topic was a replacement for another talk where the speaker was not able to attend the conference. Michael Wiedeking again gave an entertaining talk about comments in code, especially comments like i = i + 1; // Increase i by 1. He also discussed the difference between API documentation, like JavaDoc, and normal in-line comments. His resume was that instead of writing comments, one should invest time in better readable names.

Another interesting part of the talk covered different types of interfaces. He splits interfaces into three types.

  1. unchangeable public interfaces
  2. changeable public interfaces
  3. private interfaces

Type 3 is the least problematic one. This type is only used to encapsulate different parts of our software internally. Changing parts of type 1 interfaces is like changing a normal class. It is just a refactoring, because the developer checked out all usages of the interface. Type 2 interfaces are used in-house or by a small number of users, which are known by the developer. A change in this kind of interface is a bit more problematic, but with good reasons it is acceptable, because only few people have to change their software. Nonetheless it should be avoided. Type 1 interfaces are the most problematic ones, because they are published to a wide audience and used by a lot of developers world wide. A good example for this is the JDK. Changing interfaces or the visibility of interfaces of type 1 is nearly impossible. Every change of an interface of this type will break a huge number of builds and is therefore not acceptable.

Was jeder Java-Entwickler über Strings wissen sollte

This talk was held in the fashion of What every Java Programmer should know about Floating Point Arithmetic and revealed some interesting insights of Strings in Java. Before presenting those insights, the speaker gave a short introduction into measuring the performance of Java programs. This part is mainly based on blog posts from Antonio Goncalves, the book Java Performance written by Scott Oaks, and Quality Code written by Stephen Vance. Performance in Java is best measured using the Java Microbenchmarking Harness which is developed with the OpenJDK. It allows to analyze programs in scales down to nano- and microseconds and provides support to warm up the JIT compiler.

After this introduction to measuring performance in Java, the presenter shows the impact of String#intern. This function moves the content of a String into a StringTable and only saves a reference to the content. Due to this, two Strings having the same content, only need the memory space one time for the content and two times for the references to the content. Depending on the application, this could reduce the memory footprint significantly. If you want to analyze this, you can use –XX:-PrintStringTableStatistics as a command line argument. Together with the introduction of the G1 garbage collector (-XX:+UseG1GC), the String deduplication could be activated by -XX:+UseStringDeduplication.

This and that

Between the talks and on the way to and from the Java Forum there were a lot of other interesting talks. All in all it was a nice experience and I will reserve the date for the next Java Forum in my calender.

Java Forum Stuttgart – Part 2

After a long time since my first post, I had the time to write another one. In Java Forum Stuttgart – Part 1 I described the first talks I attended at the JFS 2016. In this post I will present some more impressions about the JFS.

HomeKit, Weave oder Eclipse SmartHome?

The third talk I listened to compared several smart home frameworks and their chances in the future. As an entree Apple HomeKit and Google Weave were presented. Both systems are designed as closed systems. Every vender who would like to integrate his devices into one of those systems has to support the protocol of this particular system. But because there is a big zoo of protocols out their, the presenters expect that none of this systems will be the single home automation solution for the future.

After this 5 minute introduction to the fail of HomeKit and Weave, Eclipse SmartHome (ESH) is presented as the system which could succeed in being the single solution. They base their assumption on the basic design of ESH. ESH is not a single solution. It is more like a framework where every vender can integrate his devices. Developers on the other side can access the devices in a unified way and build their solution based on ESH. One of this solutions is OpenHAB. It is the predecessor of ESH and is build on top if it.

Über den Umgang mit Lamdas

The last talk before lunch break was held by Michael Wiedeking. I heared him speaking at Herbst Campus in 2012 and was excited about his talk. Talks by him are mostly very informative and entertaining at the same time. After a first introduction what was needed to implement method references and lambdas in Java 8, he talked about the usage of lambdas in Java 8, especially how useful lambdas and method references are. At the end of his talk he presented a first solution how you can handle checked exceptions inside streams.

Top Performance Bottleneck Patterns Deep Dive

This talk was another entertaining and informative one at JFS. Andreas Grabner gave a short introduction to devops and how Otto – a big German retailer – improved its performance and time to market of new features.

In the rest of the talk he showed simple metrics to measure performance in production and which common problems he often finds. One of those metrics is a click-heatmap to measure user experience. This map shows how often a users click on areas of your page, which can be an indicator about the responsiveness of your web page.

Afterwards he presents some widespread performance problems. Place one and two are reserved for bad database handling, like not using prepared statements to reduce parsing overhead. On place three you can find bad code, especially bad control flow management. Exceptions are basically a good idea, but they can reduce performance when used in the wrong way. You can find out more about this topic on his blog.
That is enough for today. The last talks of the Java Forum Stuttgart will be published – hopefully – soon.

Java Forum Stuttgart – Part 1

Some days ago I attended Java Forum Stuttgart. After Herbst Campus in 2012, it was my second commercial conference. So I am still new to such conferences, but until now I like the format of those regional conferences. Big enough to meet new people.

As you can see in the program of the conference, a lot of interesting talks were given. Here is a short overview of the talks I attended.

  1. Eclipse on Steroids – Boost your Eclipse and Workspace Setup given by Frederic Ebelshäuser from Yatta Solutions GmbH
  2. Spark vs. Flink – Rumble in the (Big Data) Jungle given by Michael Pisula and Konstantin Knauf from TNG Technology Consulting GmbH
  3. HomeKit, Weave oder Eclipse SmartHome? Best Practices für erfolgreiche Smart-Home-Projekte given by Thomas Eichstädt-Engelen and Sebastian Janzen from neusta next GmbH & Co. KG and innoQ Deutschland GmbH
  4. Über den Umgang mit Lamdas given by Michael Wiedeking from MATHEMA Software GmbH
  5. Top Performance Bottleneck Patterns Deep Dive given by Andreas Grabner from Dynatrace
  6. Erhöhe i um 1 given by Michael Wiedeking from MATHEMA Software GmbH
  7. Was jeder Java-Entwickler über Strings wissen sollte given by Bernd Müller from Ostfalia Hochschule für angewandte Wissenschaften

Eclipse on Steroids

This talk covered the new eclipse profiles developed by Yatta. Eclipse profiles give you the ability to share your eclipse configuration between several computers or team members. Therefore every needed information about your current configuration of eclipse is saved. This includes installed plug-ins, settings, repository paths, checked out projects and working sets. The contents of your repository remain untouched. Yatta only saves the paths. The same applies for only locally available plug-ins.

The profiles can be shared via yatta.de, where you also can restrict the visibility of your profiles. You can make it visible for every one, just a group of people or only yourself. To install a shared plug-in, you can download the yatta-launcher. You only need to select the profile, specify a location for eclipse and the workspace and the launcher will do the rest. Every plug-in is installed automatically. After the first start, the launcher configures the re, but youpositories and checks out the code. This may take a while, but after it is finished, your workspace looks as close to the saved one as possible.
There are some nice other features, like caching eclipse and plug-in downloads. But the feature I am missing most from eclipse in this context is also not yet supported by yatta. There is no (known) possibility to upgrade your eclipse major version with a single click. After every major update, you have to install all plug-ins again. As mentioned yatta does not support this, but the speaker was interested in that topic. So maybe some day we can use it.

Spark vs. Flink

As the title mentions, this talk compares the two BigData frameworks Spark and Flink. They are compared by their abilities in batch and stream processing, but the main part targets the streaming possibilities. This is also the area where the two frameworks diverge the most. Flink is written as a pure streaming framework, where Spark is based on batch processing and due to that only supports micro-batch processing.

Flink is basically written in Java and Spark is written in Scala. For Java developers, this means that the Flink API feels more natural than the Spark one. The Java Spark API looks more like a Java wrapper around the Scala API. This goes hand in hand with the fact, that new features are first available in the Scala API.

Comparing both APIs against MapReduce or Storm both APIs provide a higher level of abstraction. This is not content of the talk, but the next table shows a comparison of some BigData frameworks and their level of abstraction.

Batch Streaming
high level Pig Spark Flink
low level MapReduce Storm

When both lecturers were asked which framework they would use, the answer is as always: It depends! If you have a lot of batch work and only a small part of streaming data, Spark is the framework of your choice. The integration between batch and streaming is a bit better in Spark. If it is vice versa and you have a lot of streaming data, they recommend Flink. They used Flink it their last project and it did the job quite well. It should also be mentioned here, that Google Cloud Dataflow provides support for Flink. Cloud Dataflow is a replacement for MapReduce at Google.

That is enough for today. The next part of the Java Forum Stuttgart will be published in some days.