Code Learnings: 2013

Thursday, November 28, 2013

Making Java Groovy

Manning Publications recently published "Making Java Groovy" by Ken Kouson so I went and bought it. I often buy books about topics I'm interested in and read them on the train to work to find out if the described technologies can help me with my current or future projects. I really like Manning books because they are usually a good mixture between tutorial/reference and just good introduction into technologies.

Working in an almost Java-only environment have been interested in Groovy for quite some time. Most Java developers know the problem: You have all kinds of libraries or at least code snippets that already capture your business logic. But if you want to run a quick script to make use of it, you either have to create a full blown Java app with all it's draw-backs, or go for some other scripting language that can do the trick with as little code as possible; for my quick-and-dirty scripts I usually resorted to one of bash, PHP or Perl. For more complicated tasks I bit the bullet I wrote Java app, bundled into one big jar with a myriad of command line parameters to be able to switch environment, figure out best settings and what not.

Loops in Message Routing/Graphs

This topic seems to be so beat down even to me that it was somewhat surprising to actually come across a real life example where a major corporation seemed to have not paid attention to it. Even more surprising if this happens to a company which's whole business is about routing: USPS.

I recently ordered something online that was supposed to be delivered by USPS, which was tasked to just complete "the last mile". Now I made a mistake and had the package shipped to my office but accidentally had typed in the zip code of my home address (which are only about 7 miles apart from each other). Looking at the online tracking app, I could not believe what is happening:

Don' forget about IOPS on RDS

A lot of things a slightly different when you run your application on cloud services like AWS. Take database servers: Increased load can always change the performance of DB look-ups and writes. Larger tables can lead to slow queries if the tables are not indexed right, a lot of writes may cause unexpected locking, etc.

However if you use Amazon's AWS there is another important factor you should not loose sight of: IOPS. This measures the number of I/O operations between the database server and its storage, which is attached as network storage.

It's somewhat unclear if/how Amazon throttles the throughput if you don't reserve IOPS. You might be at the mercy of other RDS instances that are running on the machine and probably some kind of rate limit AWS implicitly imposes.

Graph Traversal: Breadth First Search

Breadth First Search (BFS) is a fundamental graph-traversal technique and you could use it of find it being used in a wide range of applications. For a full explanation of the technique check this link, or this link.

The parent array in BFS is very useful for finding interesting paths through a graph; the parent relation defines a TREE OF DISCOVERY with the starting node as the root of the tree.

Because vertices are discovered in order of increasing distance from the root, the tree of discovery contains minimum-length paths from the root to each node.

The time complexity of a BFS implementation is: O(M + N) , that is, it runs in linear time as a function of the sum of the cardinalities of the vertex and edge sets.
Two example application problems where BFS plays a role are:

Connected Components
Two-Coloring (Bipartite checking)

Graph Traversals: Generic Implementation

Recently I had to go back and dive deep into the fundamental elements involved in graph traversal algorithms; specifically, I wanted to get refreshingly clear in my head the essence of these and related techniques. I found that the approach taken in the book I always recommend was right on target to allow me to reach my objective. The approach Prof. Skiena proposes is captured somewhat in the implementation we describe in this and the next two posts in this series.

Base Classes

All graph traversals share a certain "philosophy", in the sense that they are trying to achieve a goal (traverse the graph), and to do so they maintain some state (e.g. discovered, time, ...) and use some appropriate data structure explicitly (queue) or implicitly (stack). The idea is then to abstract those commonalities as much as possible and provide "entry" points to the implementation by means of an API (function/method calls), which enable the tailoring of a generic implementation to the solution specific to a given problem.

Graph Traversals

One of the most fundamental graph problem is: traverse all its nodes and edges systematically. For example, mazes are naturally represented as graphs, and therefore finding a path in a maze can easily be solved by executing some graph traversal algorithm.

The key idea behind graph traversals is: MARK every vertex when we FIRST visit it, and keep track of those vertices NOT YET EXPLORED. In such an approach, each vertex in the graph will be in one of three possible states:

UNDISCOVERED
DISCOVERED
PROCESSED

and in a given traversal, the state progression of a vertex would be UNDISCOVERED, then DISCOVERED, and finally PROCESSED.

Thus, at the core of a graph traversal implementation is a data structure to keep track of the work to be done, that is, the vertices that have been discovered but have not been processed yet. During a traversal, when we reach an UNDISCOVERED vertex, we mark it as DISCOVERED and add it to the data structure as "work to be done". If we reach a vertex that is already DISCOVERED/PROCESSED, we ignore it.

If the graph being traversed is DIRECTED, then edge will be traversed once; if the graph is UNDIRECTED, then each edge will be traversed twice.

Two of the most often used graph traversal algorithms are:

Breadth First Search
Depth First Search

In a following post we will describe each of these approaches. Stay tuned.

Saturday, August 3, 2013

Graphs

Graphs rest at the core of much of computer science. Very often, graph modelling is key to solving algorithmic problems. As a Software Engineer, one needs to have a clear understanding of graphs and, most importantly, how to model real-world problems in terms of graphs. It is of paramount importance to be familiar with a wide range of graph problems; more so than understanding the nuances of how a given specific graph algorithm works. For instance, let's say that you work in a company that is generating tremendous amounts of data and you are considering the options for manipulating it and extracting from it business intelligence related to your application; one option you have is to use a Graph Database. If you go that path, the underlying implementation if the graph database you select will contain a wide range of graph algorithms, which will be used to answer the queries you make against the DB. Being aware how some of the nuances related to graph representations and algorithms will help you decide what is the best way to formulate questions and ask them to the graph DB being used.

Again, for an excellent coverage of graph algorithms check Prof. Skiena's book.

In this and some of our future posts, we will elaborate on graphs in general and some graph algorithmic approaches. The material has been collected and/or summarised from a variety of sources, such as the Internet, books, etc.

Nested Properties in Camel Routes

Apache Camel does not support Spring's nested properties in its configuration. I believe this feature is coming up, but at the time of writing this it's neither the case in the version we're running (2.9.1) nor in the current version (2.10.6 seemed totally broken, 2.10.5 also doesn't support it). This is generally unfortunate but in our case we're particularly missing this feature for the autoStartup property of a route.

We usually have groups of routes that are either started together or not at all. However, for e.g. debugging it would be nice to turn the whole group off and specific routes in the group on. This can be achieved relatively easily with a simple helper class that gets properties injected by Spring, hence supporting nested properties.

Dynamic Programming

Algorithms for optimisation require proof that always return the optimal solution. Greedy algorithms are typically efficient but they do not guarantee optimality. On the other hand, exhaustive-search algorithms do guarantee optimality but are usually very expensive to run. Dynamic Programming (DP) combines the best of both worlds. For a very good coverage of DP, check the wonderful book by Prof. Skiena.

The basic idea behind DP is to search for all possible solutions (correctness) while storing computed results along the way to avoid duplicate work (efficiency). Thus, in a nutshell, DP is a technique for efficiently implementing a recursive algorithm by storing partial results.

What is the TRICK behind DP? to determine whether the recursive algorithm we are analysing computes the same sub-problems over and over again. If so, storing the answers to this sub-problems in a table can lead to a very efficient algorithm. Once this fact has been determined, proceed by applying the following three-step approach:

START with a working recursive algorithm
IDENTIFY repeated computations
DERIVE the DP-based solution

In this post, we show an example of this approach, by means of the derivation of a DP-based implementation of the Fibonacci function.

Problem Storing Spring-Batch Job Parameters

We have been using Spring-Batch at inPowered for a while for relatively simple jobs with only a date parameter; MySQL is the backing store for meta data. When we added a job with a parameter of type string the job launcher had problems even getting the job started and the Spring-Batch UI displayed this error:

org.springframework.web.util.NestedServletException: Request processing failed; nested exception is org.springframework.dao.DataIntegrityViolationException: PreparedStatementCallback; SQL [INSERT into BATCH_JOB_PARAMS(JOB_INSTANCE_ID, KEY_NAME, TYPE_CD, STRING_VAL, DATE_VAL, LONG_VAL, DOUBLE_VAL) values (?, ?, ?, ?, ?, ?, ?)]; Data truncation: Incorrect datetime value: '1970-01-01 00:00:00' for column 'DATE_VAL' at row 1; nested exception is com.mysql.jdbc.MysqlDataTruncation: Data truncation: Incorrect datetime value: '1970-01-01 00:00:00' for column 'DATE_VAL' at row 1

Auto-Scaling with AWS Spot Instances

While working on inPowered's back-end systems we are using a lot of different AWS services. One of our sub-components recently needed to be able to handle many times the load it had been handling very well for over a year. Since we're using auto-scaling groups (ASGs) for pretty much everything of course the solution was to simply scale up more instances. However we had to go from a few dozen to a few hundred instances almost over night. And even though this ASG was utilizing t1.micros controlling cost became a concern. So we decided to try using spot instances.

Code Learnings in a Nutshell

A software engineer has the main objective of solving problems from a wide variety of domains. One of the main principles used in Computer Science and Software engineering is Reusability. As defined in Wikipedia, reusability is the likelihood that a segment of source code can be used again to add new functionalities with slight or no modification. Reusable modules and classes reduce implementation time, increase the likelihood that prior testing and use has eliminated bugs and localizes code modifications when a change in implementation is required. In this blog we aim at capturing code learning and solution patterns we have learned and applied in our careers.

Code Learnings

Pages

Thursday, November 28, 2013

Making Java Groovy

Tuesday, August 27, 2013

Loops in Message Routing/Graphs

Wednesday, August 7, 2013