User Tools

Site Tools


jobqueue

This is an old revision of the document!


Long-running jobs with a jobqueue

By introducing simulations and predictions, we have created a job type that does not coexist well with the interactive website. These jobs must be outsourced to another machine (or a pool of machines). These jobs are executed as batch-jobs, i.e. they don't need user interaction.

Some requirements for this design:

  • Jobs must be self-contained. A job should describe where to get input data (fully qualified URL) and where result data should be published (also fully qualified URL).
  • The lingua franca of the mySmartGrid ecosystem is JSON. Most visualization is done directly using JSON. The (new) data submission API also employs JSON. Therefore, result data (such as a predicted timeseries) must be represented as JSON data and published via the HTTP protocol.
  • The simulations must run on a different system as the webserver.

Beanstalkd

Beanstalkd is a simple job queue designed for minimal overhead: http://kr.github.com/beanstalkd/

The beanstalk protocol is very simple: https://github.com/kr/beanstalkd/blob/v1.3/doc/protocol.txt. This is a very good thing(TM), because it leads to a bunch of client implementations that are ready to use.

The architecture for mySmartGrid will look as follows:

Super-sketchy sketch

The architecture consists of a webserver, several worker nodes and a beanstalkd instance. The webserver can submit jobs to the beanstalkd queue. A job is a JSON-formatted document containing the desired result URL and $foo as input data for the job. The worker node then retrieves the job from the queue and dispatches a process locally on the worker node. As soon as the process finishes, the job is deleted from the queue. The results of the job are published using the local HTTP server. Again, data is JSON-formatted. In order to be able to detect outdated information, a timestamp MUST be included in the result file.

Right now, there is no redundancy in the system. I assume that the webserver decides which workernode should run the job. So, the results are accessible on an URL like http://worker1.mysmartgrid.de/<UUID>. This makes it easy for the webserver to render pages that include a link to the data. The web browser of the users can then access the data seamlessly.

The disadvantage of this solution is that there is one queue per worker. If a worker fails, the old data published on the server is not available any longer. In addition, new jobs are not processed. This can be mitigated in future releases:

  1. A loadbalancer makes a group of worker nodes accessible. All results are available using the base url of the loadbalancer.
  2. Worker nodes work on a common queue.
  3. The results are rsync'd across all worker nodes.
  4. If one worker fails, it is not used by the loadbalance any more. The other workers run the additional jobs.

In the long term, this setup should be much more stable. Right now, however, it seems to be overkill. Note for the future: Check out Varnish.

jobqueue.1310477704.txt.gz · Last modified: 2012/10/30 10:34 (external edit)