1/08/2012

Data, Functions and Objects

Assassinations by Stateless on Grooveshark

One thing I realized while learning Clojure is how data, functions and objects relate to each others.

Data is a structured set of values or, said another way, values within a data structure (e.g. a list of integers, a map of [key, value] pairs, a struct).

A function takes data as input and produces data as output.

An objects is a combination of data and functions within the same entity.

The problem with Object-Oriented languages like Java is that Java programmers tend to forget about data and functions and think only in terms of objects. This is problematic because data and functions are simpler than objects. Not only simpler to write, but also simpler to test and maintain. Objects are great for modeling because this type of entity is closer to the way humans see the world. But objects bring complexity because of the state notion that comes with them. Stateful objects can be painful to test and difficult to maintain. When you add multi-threading to the mix, it can become a real nightmare.

So, in order to keep the code simple, the programmer should favor data and functions over stateful objects.

Let's look at an an example. Here we have an Employee class defined the typical Java way. Data (employee's information) and functions (processing) are coupled together in a class. Some methods support the data (getters/setters) and the other methods are the functions that can be applied to employees.

Listing 1
As I mentioned in a previous post, a better approach would be to have the employee's data immutable to avoid any issues with multi-threading. For example, if computeBonus() is called and targetBonus is changed by the setter while it executes, the result is incoherent and wrong. So a better version of this class would be immutable.

Listing 2
Now, if we take a closer look at the employee's methods, we have salary related methods and a HR related method (isEligibleForBenefits()). As the system grows the Employee class will probably grow and more and more unrelated methods will be added to the Employee's class. So what if we look at it from the functional programming perspective. Let's split the data from the functions.

Listing 3
I'm not sure yet what is the best way to represent data in Java, but using public final fields seems nice since the data is immutable and the code is not cluttered with getters. Now if the data has to be modified, either a new Employee is constructed or an EmployeeBuilder can be written.

To write the functions, there are many alternatives. Let's try with static methods.

Listing 4
Static methods are easy to use, but they have one drawback: the client code is strongly coupled with the method. This means that the client code will always call the exact same method at runtime. This can cause trouble for the tests. Mock libraries such as PowerMock could be used to mock static methods, but the test code seems odd as the programmer will have to mock a class (the one containing the static methods) that seems unrelated to the test since the dependency is hidden in the class under test. The test code then looks like dark magic to me. Also, if we compare this option with the original Employee version, we have lost the option of polymorphic methods (runtime method dispatch). So I think static methods are only good for utility functions, those which are not likely to change and that can be use in many unrelated context such as Math.round().

Another option is to create a class that wraps the Employee data and adds the functions.

Listing 5
Basically, we are back to square one. This class is equivalent to the original Employee class, but more cumbersome to use. The only small gain I see is that it is possible to group the different employee functions in different classes.

I think a good option is to use an attribute-less class.

Listing 6
With this option, we have gain something compared to the static functions: polymorphism. We could create an interface for the class and implement the same methods differently. This is great for tests and great for flexibility. We can also group the functions logically (salary functions in one class, HR functions is another, ...). The downside is that we need to instanciate the class in order to use it and inject it into the client code otherwise the benefits are lost because instanciating the class in the client code using the new operator is the same as using static methods. The new operator is a static method that hardcode the dependency.

We could go one step further...

Listing 7
Now factor is an immutable attribute. This is great if, for some parts of the system, the factor value is always the same; the programmer does not have to specify it at every call and even better, he doesn't have to propagate it to all the callers which are now independent to this bonus computation detail. Put in another way, employee is a variable and factor is a parameter of the function (as x is a variable and A and B are parameters in f(x) = Ax + B).

There are cases where an object (state attributes and methods) is the way to go. But when the intention is to apply functions to data, I think a clean alternative is to split the data from the functions by using a simple data object for the data and stateless objects for the functions.

What do you think?

12/31/2011

Retour

The Low Hum by Moby on Grooveshark

The detour took longer than expected. I left to learn about Clojure and how its constructs can simplify a program, but I ended up taking a different turn and left the OZ -> Nokia -> Synchronica boat after more than 7 years to work for a very interesting Montreal start-up called Wajam.

In the next few weeks, I'll focus on entering the start-up mode and working on my confoo presentation...but I have a few more ideas to write about: single responsibility principle, questioning test driven development, software binary mode and some ideas from my small incursion into functional programming...

Stay tuned for more music driven development!

...and Happy New Year!

10/30/2011

Detour



I'll take a detour. I will pause writing about simple testable code for the month of November. Instead, I will focus on learning and understanding Clojure. Why? Because of these two talks by Rich Hickey, Clojure's creator.

InfoQ: Are We There Yet?
InfoQ: Simple Made Easy

In these talks, Rich explains limitations of object-oriented programming and how it affects program complexity. In Simple Made Easy, he gives some pointers as to how program could be simplified by using simpler constructs. His explanations are clear and the arguments are compelling. I strongly suggest to watch these two talks, they are well worth the time.

At this point in my quest for simple testable code, I believe I need to understand how Clojure simplifies program and why its constructs make it so. Hopefully, I will be able to bring some of this back to the Java world to create even simpler testable code :)

10/24/2011

Final, Clean and Simple



I've read Clean Code. This is a great book for any developer that values craftsmanship. Throughout the book, I could not stop thinking about the link between clean code and simple code. They should be the same don't they? Clean code is simple, simple code is clean. Robert C. Martin's book is clearly about making code clean and easy to read. Could someone who follows all the rules in Clean Code still end up with complex but readable code.

The following passage of Clean Code highlights a potential clash between clean and simple:

"I think that there are a few good use of final, such as the occasional final constant, but otherwise the keyword adds little value and creates a lot of clutter. Perhaps I feel this way because the kinds of errors that final might catch are already caught by the test I write."

To reduce code clutter, programmers should rarely use final. I think this is wrong. It is wrong because the final keyword reduces complexity. Applied on attributes, it limits the state space of the object. It tells the programmer not to worry about this attribute's reference changing over the lifetime of the object. Whenever I see a final attribute, I can prune possible state transitions. I'm all for adding a little clutter to the code to explicitly reduce the program apparent complexity. Regarding the second statement of the passage, I do not believe final should be used to catch errors. It should be use to limits complexity, to simplify the code. Tests and the use of final are not related.

More generally, minimization of code and simplification of code are not the same thing. I initially thought that less code is better. I was wrong. After watching this talk by Rich Hickey, I understood a more powerful meaning of simplicity: one concept, one responsibility, one role. Clean Code does a good job of highlighting this desirable code attribute. If you split the code into single purpose, untangled methods, classes and components, the number of elements in the design will increase. You will end up with more, not less. As Rich says in his talk, simplicity is not about counting.

Understanding the real meaning of simplicity (vs easy, vs clean) is important. It is also important to understand what has to be simple. Is it the code syntax and format or the program structure and the elements that it consists of? I think programmers to often applies simplicity on the code surface, but it is when it is applied on its structure that the long term gains are impressive.

10/17/2011

About Structure and Value


I've been thinking about the expert vs master dichotomy a bit more these days and my conclusion is that it is all about structure and value.

Expert programmers value structure first, master programmers build value first.

10/10/2011

The expert, the master...a Java Version


One of the inspiration for this blog is this post by Zed Shaw where he points out what he thinks is the difference between an expert and a master programmer.

The expert, the master, the programmer.

For fun, I wrote a Java version.

10/05/2011

Event Driven I/O


The new cool web framework is node.js. It is a javascript web framework that uses OS calls to get notified when an I/O operation is ready to be made. This way, a single thread can manage the processing of many network connections. The concept is not new (java, python and ruby all have it)...but for the web app world, it seems to be a revelation.

Earlier this week, there was a debate over some claims that node.js puts on their web site that started with this post by Ted Dzubia. This debate got out of hand as the node community tried to invalidate the arguments, but somehow missed the point completely. So Ted came back with a great post that compares threaded systems (TS) vs evented systems (ES): Straight Talk on Event Loop. He shows that the theoretical throughput (query per seconds) of an ES is almost always inferior to the one of a TS. It is all well done...but comparing the two paradigms using throughput only is incomplete.

Intuitively, if an ES is high on I/O but low on CPU, then the event thread will be able to process a high number of transactions per seconds. But a large amount of threads can be used to reach the same throughput. Alternatively, if the system is low on I/O but high on CPU, the event thread becomes quickly the bottleneck. Using more that one event thread improves the ES performance, but does not alter the conclusions.

Also, I agree with him that claiming: because nothing blocks, less-than-expert programmers are able to develop fast systems ... makes no sense.

To me, the main objective of ES is to decouple threads (which is a bounded resource) with the number of concurrent in progress transactions a system can handle. This is why ES are said to be more scalable...not because they are faster, but because they allow a massive number of in progress transactions to be processed with only a few (bounded) threads... the trade-off is less throughput.

In the article, Ted mentions forking to worker threads for processing the request. This reduces the processing load on the event thread which could consequently process more operations. By doing that, the programmer is basically using an ES to drive a TS. So the benefit of using a limited number of threads is gone. But a bounded worker thread-pool could be used. This is probably the best approach, but this does not simplify the system and tweaking the pool size is not a trivial task.

The only case where I think an ES should be considered is when the system is high on I/O, low on CPU, and has to process a very large amount of concurrent transactions. I've been using, writing, debugging an event networking system to build scalable protocol gateways for the last few years and for this type of product, I do not think a TS could do the job.