Thursday, 14 March 2013

Event-sourced architecture, travelling Java -> Scala (Part 2: Threads, connections)

Http sever, threads.

When we are talking about scalability we should always start from IO. It is well reworked during last years, let's look into the history.

HTTP 1.0

Nobody remembers how it was, just the fact that on each call server to client connection was opened and closed. Some overhead of course.

HTTP 1.1

Persistente connections helped to solve the issue with opening closing the TCP connection for each request, now connection can be reused for several request - response sessions.

From the session protocol, until the coming of Web Sockets we weren't expect any possible improvements. Some rework was done on server level:

Thread per Connection

In old generation servers followed that model: there was a thread pool for HTTP dispatching module, when a new connection were established, thread were picked up from the pool and assigned to connection. The main issue was scalability - the number of connections were limited by maximum size value of the thread pool itself. When this parameter was reached - server started to reject new connections. In therms of JVM - thread is heavy and is mapped to native thread. There are many limitations coming from operation systems and JVM settings, additionally the more threads we have the more CPU resources we need for switching between them. Of course the faster code is working - the more clients we can support, while threads are returned to pool quickly. The duration of single request - response session was the main performance improvement lever for developers. 

Thread per Request  

The implementation of Thread per Connection was reviewed in NIO Framework. It was decided to apply producer - consumer pattern for IO operations (of course with set of other patterns). NIO allowed to detach the Thread from the persistent HTTP connection for the time of dispatching the request, and attach back when response is ready. While connection is Idle (awaiting for the Response) - Thread isn't blocked and can dispatch others connections. When connection isn't assigned to response - it goes to the centralised NIO's select set to be assigned to the new requests, during been is such state connection still doesn't allocate any thread. 
As any producer-consumer pattern application - we are breaking losing coupling between input and executor, that allows us to tune the input layer separately from the execution one. On other platforms Ruby etc., we can meet the Thread per Connection as Threaded design and Thread per Request as Evented. The idea is the same we dont' dispatch the request in the connection thread.

In any case despite on the way we have implementation the producer - consumer pattern - we have to change the way we are coding. System should notify us when Request has come, we should notify the system - when the Response is ready.
This part is easy to get and doesn't require great contribution. There are many practices how to notify the that result is ready, for example get the callback function as a parameter or use a Promise pattern. But what id does change if we are still using the blocking logic (some shared memory, blocking DB requests etc.)? At he end it's absolutely the same as we had before - 1 thread per business execution. Actually we splited the dependency of threads count from the connections in case of business doesn't allocate the systems thread, that isn't the case for default programming approach.
As we remember from the first post the possible bottleneck (thread starvation), at least via delegated this problem to next standing layer.
Now we have two main problems to be solved:
  • threads limit for business execution in system is indirectly limiting the clients count we can support
  • in case if the first problem is resolved, we have the possible performance issue - if we use any shared resource - we will need transactional behaviour. This was managed partially by our resource layer or DAO in the case of the optimistic locking.


Lets start from the topic that unites the both problems - concurrency. JVM came out of the box with concurrency support. While it was  implemented long time ago, it was based under the Shared State paradigm:
  • Program counter
  • Own stack
  • Shared Memory
  • Locks
Explicit synchronisation is on the most complex for an implementation. It's error prone, hard to test and harder to tune because of:
  • manual lock and unlock
  • lock ordering is a big problem
  • locks are not compositional
  • how do we decide what is concurrent?
  • hard to manage any change requests
Actually plain synchronisation via locking is far from the abstraction we usually applying in the case of any complex problem has been met. There were few theories and approches researched:
  • Functional Programming
  • Actors
  • CSP
  • CCS
  • petri-nets
  • pi-calculus
  • join-calculus
As I mentioned before abstraction is our main weapon against the complexity. In nova days there are several models hardly used in commercial programming:
  • Transactional memory
  • Actors
  • Dataflow
  • Tuple spaces
We are going to pickup 2 of them - transactional memory and Actors.


Ok, we know that concurrency design and implementation isn't the easiest topic. We know that there are models to help the developers to be abstracted from explicit thread management.
These models help to solve concurrent tasks as Object Oriented or Functional Programming paradigms helped to implement domain solutions.
JVM always had a concurrent programming support as high priority, since Java 5 it became much better in therms of the memory model. We got additional rules which made easy to build multithreading environments. If talking about OS and JVM cooperation - JVM is still using binding 1 Java thread to 1 native (known exception for me is Solaris).  Native threads are still heavy for OS, they designed to support shared memory, that is why designed with keeping in mind on synchronisation  context switching, locking and others heavy operations. Threads count in real system is limited by few thousands, and even worse - the CPUs can spend a lot of time on context switching. Main limitations for threads limit in system are:
  • over-provisioning of stacks which leads to quick exhaustion of virtual address space (at least on 32-bit machines)
  • locking mechanisms which lack suitable contention managers

In meantime the Green Threads were invented. Actually the lightweight processes are implementing:

  • availability to switch by processors from threads
  • dont't require the shared memory as native Processes. 

Erlang processes are the best one example. They are following concepts of CSP and CCS. 
To start using them we must design our processes with keeping in mind : we must avoid usage of the shared memory (all exchange can be implemented via message events) and additionally exclude usage of blocking operations inside (well this part is more complex, but still - can we use NIO style in DB client?). Let's review lightweight processes in details: they are not binded to threads, switching between the processes is the job for VM, that takes 16 - 20 processor instructions and dozens of milliseconds, while switching between native Threads requires the call to kernel, additionally the more threads we have, the longer switching we will get. That isn't the condition for lightweight processes. Switching time for the native threads is in the hundreds milliseconds range.
It would be great win of using lightweight processes - but price is high. Usual application must be redesigned to map into such model, and it's not always possible to do that. Despite on limitations of light processes during the Concurrency Era we are pushed to the new approaches despite they are bringing an additional effort. Unfortunately scaling up of hardware is in the past and we are faced with requirements for scaling out.

In next post we will review the Akka Actor implmentation for the Thredless approach.

No comments:

Post a Comment