1/08/2012
Data, Functions and Objects
One thing I realized while learning Clojure is how data, functions and objects relate to each others.
Data is a structured set of values or, said another way, values within a data structure (e.g. a list of integers, a map of [key, value] pairs, a struct).
A function takes data as input and produces data as output.
An objects is a combination of data and functions within the same entity.
The problem with Object-Oriented languages like Java is that Java programmers tend to forget about data and functions and think only in terms of objects. This is problematic because data and functions are simpler than objects. Not only simpler to write, but also simpler to test and maintain. Objects are great for modeling because this type of entity is closer to the way humans see the world. But objects bring complexity because of the state notion that comes with them. Stateful objects can be painful to test and difficult to maintain. When you add multi-threading to the mix, it can become a real nightmare.
So, in order to keep the code simple, the programmer should favor data and functions over stateful objects.
Let's look at an an example. Here we have an Employee class defined the typical Java way. Data (employee's information) and functions (processing) are coupled together in a class. Some methods support the data (getters/setters) and the other methods are the functions that can be applied to employees.
Listing 1
As I mentioned in a previous post, a better approach would be to have the employee's data immutable to avoid any issues with multi-threading. For example, if computeBonus() is called and targetBonus is changed by the setter while it executes, the result is incoherent and wrong. So a better version of this class would be immutable.
Listing 2
Now, if we take a closer look at the employee's methods, we have salary related methods and a HR related method (isEligibleForBenefits()). As the system grows the Employee class will probably grow and more and more unrelated methods will be added to the Employee's class. So what if we look at it from the functional programming perspective. Let's split the data from the functions.
Listing 3
I'm not sure yet what is the best way to represent data in Java, but using public final fields seems nice since the data is immutable and the code is not cluttered with getters. Now if the data has to be modified, either a new Employee is constructed or an EmployeeBuilder can be written.
To write the functions, there are many alternatives. Let's try with static methods.
Listing 4
Static methods are easy to use, but they have one drawback: the client code is strongly coupled with the method. This means that the client code will always call the exact same method at runtime. This can cause trouble for the tests. Mock libraries such as PowerMock could be used to mock static methods, but the test code seems odd as the programmer will have to mock a class (the one containing the static methods) that seems unrelated to the test since the dependency is hidden in the class under test. The test code then looks like dark magic to me. Also, if we compare this option with the original Employee version, we have lost the option of polymorphic methods (runtime method dispatch). So I think static methods are only good for utility functions, those which are not likely to change and that can be use in many unrelated context such as Math.round().
Another option is to create a class that wraps the Employee data and adds the functions.
Listing 5
Basically, we are back to square one. This class is equivalent to the original Employee class, but more cumbersome to use. The only small gain I see is that it is possible to group the different employee functions in different classes.
I think a good option is to use an attribute-less class.
Listing 6
With this option, we have gain something compared to the static functions: polymorphism. We could create an interface for the class and implement the same methods differently. This is great for tests and great for flexibility. We can also group the functions logically (salary functions in one class, HR functions is another, ...). The downside is that we need to instanciate the class in order to use it and inject it into the client code otherwise the benefits are lost because instanciating the class in the client code using the new operator is the same as using static methods. The new operator is a static method that hardcode the dependency.
We could go one step further...
Listing 7
Now factor is an immutable attribute. This is great if, for some parts of the system, the factor value is always the same; the programmer does not have to specify it at every call and even better, he doesn't have to propagate it to all the callers which are now independent to this bonus computation detail. Put in another way, employee is a variable and factor is a parameter of the function (as x is a variable and A and B are parameters in f(x) = Ax + B).
There are cases where an object (state attributes and methods) is the way to go. But when the intention is to apply functions to data, I think a clean alternative is to split the data from the functions by using a simple data object for the data and stateless objects for the functions.
What do you think?
7 comments:
Felix,
Interesting post. In most cases, gist #2 would be the best solution. You have made the class immutable and self-contained and that's great. Your justification for going further was because there was a mixture of HR and salary related methods and that class might become a mess over time. This is a reasonable concern but if that were to happen, I would propose another technique: creating an interface for HR methods and another one for salary methods and then have Employee implement both. I agree that having a class serve too many masters (HR, salary) is bad but it's really only bad for clients of Employee - not the class itself. It is bad for clients because it can be confusing and dangerous in that they can start calling methods they shouldn't (e.g. changing the salary). It's actually not bad for the Employee class itself which is self-contained and much better for encapsulation of data. This is why the multiple interface does the trick: via the interface, clients only see what they need to see (e.g. HR or salary), the API remains clean, complexity is reduced and encapsulation is maintained.
There are cases where I would advocate your final solution, for example, when 2 domain islands need to be joined (too long to explain). Otherwise, it might be overkill.
BTW, I miss having these discussions with you in person!
I like the last version, especially given your description of employee being a variable and the factor being a parameter.
I would be tempted to try a version as a pseudo-singleton (for lack of a better term) with a factory method that takes the factor. It would allow this class to be easily mocked for tests (with methods to install/reset a mock instance) while making it easy to use in client code. But in the end, proper dependency injection would fix what this would be trying to address so this last version is probably best for the general cases.
Good stuff! ;)
Thanks for the comments! This becomes a very interesting discussion :)
I think I need to add some points. First, yes, in most cases Listing 2 with the immutable employee class would probably be the best option, or at least the option to start with. Even though the final solution might seem overkill, it is interesting to see that it is a valid and testable alternative to static functions. I've started using it with dependency injection to replace/wrap static methods in legacy code and it's been a pleasure so far.
@Nick - My main concern is not the clients of the Employee class, but the whole code (the clients and the Employee class itself). Yes, interfaces could simplify grouping the functions logically from the clients point of view, but it does not change the fact that the Employee class itself might have too many responsibilities. Encapsulation is good argument since exposing all Employee's fields type can cause problem if a list of objects was to be changed to an array of objects for example. So maybe using getters on the data class is better. But to take another example, I've seen a UserSession class that got incredibly long and tangled using the standard java pattern of listing 1 and 2 :P.
Also, the next logical step in the reasoning would be to create a EmployeeFunction class that would have a generic interface with execute(Employee e, Context c) method. As you know, this pattern is heavily used in frameworks for chaining arbitrary processing units.
Yeah...I miss those conversations too :)
@Alex Yes, using dependency injection here is the key otherwise I think static method is better or the pseudo-singleton factory for that matter.
Felix,
How much do I have to pay you to never mention UserSession again!
In all seriousness, it's usually better to shift complexity, if there is any, onto the implementation rather that the API. The final solution seemed to do the opposite. But it's hard to have this debate based on your abstract example. The right approach would depend on the circumstances and good judgement.
Nick, I'm not sure I understand why you think the last option puts the complexity on the client. Two classes to define functions or two interfaces is very similar. One assumption I make is that the client is injected the SalaryFunctions object. Otherwise, I agree with you that this is more cumbersome to use for the client.
I think the essence of all this is as you say: The right approach depends on the circumstances and good judgement. My goal was to highlight some other options to add to the programmer's toolkit.
Also, I would add that I still believe applying functions to data is simpler that constructing objects...but the more I try to push this concept in Java programs the more I hit walls because Java has no support for functions (yet!). Hopefully, Java will evolve and we, the programmers, will have more tools inside the language.
Very interesting discussion! I think you guys covered pretty much anything that could be said about this simple example ;)
While reading all the proposed solutions in the post, there was something in all of them that annoyed me, that prevented me from choosing THE right one, but didn't know what it was. I think I figured it out now: each solution seems overkill for such a simple example. The fact that you extract these computations in an external class will add complexity to your test. This is a trivial conclusion of course, because it's only an example, but it brings a different topic on the table: what is the level of granularity we want to reach in the modularity of our testable components. Going with a fine-grain modularity will probably just make your tests way too complex with a lot of mock objects. Going with a coarse-grain modularity will bind your tests to multiple unrelated components, which is bad. Again, this is all about balance.
In your example, extracting these simple computational functions in an class that will have to be mocked when testing the client code is probably an example of too fine-grained modularity. The client test code simply doesn't have to know that this logic comes from another class. In this perspective, I would probably opt for the listing 4, the static method approach.
Y.
Very interesting Yanick. I agree that most solutions are overkill for this simple example. I think you have a good point worrying about the test code. I do not think that any of these really adds complexity to the testing, but yes, extracting the functions in different classes also increase the number of test classes (but not necessarily the number of test cases).
But at the same time, the simpler the test code, the better. In my experience, stateless code is simpler to test since the test setup does not have to set the state properly first before calling the method to test. For that reason, I think designs like the last few listings are slightly better.
A note about mocking. Yes, breaking the functionality into many classes will increase the number of dependencies and the number of mocks. There is also the option of using the real production classes...doing that or using static methods is the same from the test code perspective.
I think probably the right approach is to go from listing-1 to listing-7 one step at a time when the code becomes painful.
It's all about the balance! But finding that balance is the hardest part!
Thanks for your comment!
Post a Comment