Monday, August 18, 2014

A NetflixOSS sidecar in support of non-Java services

In working on supporting our next round of IBM Cloud Service Fabric service tenants, we found that the service implementers came from very different backgrounds.  Some were skilled in Java, some Ruby and others were C/C++ focused and therefore their service implementations were just as diverse.  Given the size of the team of the services we're on-boarding and timeframe for going public, recoding all of these services to use NetflixOSS Java libraries that bring the operational excellence (like Archaius, Karyon, Eureka, etc) seemed pretty unlikely.

For what it is worth, we faced a similar challenge in earlier services (mostly due to existing C/C++ applications) and we created what was called a "sidecar".  By sidecar, what I mean is a second process on each node/instance that did Cloud Service Fabric operations on behalf of the main process (the side-managed process).  Unfortunately those sidecars all went off and created one-offs for their particular service.  In this post, I'll describe a more general sidecar that doesn't force users to have these one-offs.

Sidenote:  For those not familiar with sidecars, think of the motorcycle sidecar below.  Snoopy would be the main process with Woodstock being the sidecar process.  The main work on the instance would be the motorcycle (say serving your users' REST requests).  The operational control is the sidecar (say serving health checks and management plane requests of the operational platform).


Before we get started, we need to note there are multiple types of sidecars.  Predominantly there are two main types of sidecars.  There are sidecars that manage durable and or storage tiers.  These sidecars need to manage things that other sidecars do not (like joining a stateful ring of servers, or joining a set of slaves and discovering masters, or backup and recovery of data).  Some sidecars that exist in this space are Priam (for Cassandra) and Exhibitor (for Zookeeper).  The other type is for managing stateless mid-tier services like microservices.  An example of this is AirBNB's Synapse and Nerve.  You'll see that in the announcement of Synapse and Nerve on AirBNB's blog that they are trying to solve some (but not all) of the issues I will mention in this blog post.

What are some things that a microservice sidecar could do for a microservice?

1. Service discovery registration and heartbeat

This registration with service discovery would have to happen only after the sidecar detects the side-managed process as ready to receive requests.  This isn't necessarily the same as if the instance is "healthy" as an instance might be healthy well before it is ready to handle requests (consider an instance that needs to pre-warm caches, etc.).  Also, all dynamic configuration of this function (where and if to register) should be considered.

2.  Health check URL

Every instance should have a health check url that can communicate out of band the health of an instance.  The sidecar would need to query the health of the side-managed process and expose this url on behalf of the side-managed process.  Various systems (like auto scaling groups, front end load balancers, and service discovery queries) would query this URL and take sick instances out of rotation.

3.  Service dependency load balancing

In a NetflixOSS based microservice, routing can be done intelligently based upon information from service discovery (Eureka) via smart client side load balancing (Ribbon).  Once you move this function out of the microservice implementation, as AirBNB noted as well, it is likely unneeded and problematic in some cases to move back to centralized load balancing.  Therefore it would be nice if the sidecar would perform load balancing on behalf of the side-managed process.  Note that Zuul (on instance in the sidecar) could fill this role in NetflixOSS.  In AirBNB's stack, the combination of service discovery and this item is done through Synapse.  Also, all dynamic configuration of this function (states of routes, timeouts, retry strategy, etc) should be considered.

One other area to consider here (especially in the NetflixOSS space) would be if the sidecar should provide for advanced devops filters in load balancing that go beyond basic round robin load balancing.  Netflix has talked about the advantages of Zuul for this in the front/edge tier, but we could consider doing something in between microservices.

4.  Microservice latency/health metrics

Being able to have operational visibility into the error rates on calls to dependent services as well as latency and overall state of dependencies is important to knowing how to operate the side-managed process.  In NetflixOSS by using the Hystrix pattern and API, you can get such visibility through the exported Hystrix streams.  Again, Zuul (on instance in the sidecar) can provide this functionality.

5.  Eureka discovery

We have found service implementation in IBM that already have their own client side load balancing or cluster technologies.  Also, Netflix has talked about other OSS systems such as Elastic Search.  For these systems it would be nice if the sidecar could provide a way to expose Eureka discovery outside of load balancing.  Then the client could ingest the discovery information and use it however it felt necessary.  Also, all dynamic configuration of this function should be considered.

6.  Dynamic configuration management

It would nice if the sidecar could expose to the side-managed process dynamic configuration.  While I have mentioned the need to have previous sidecar functions items dynamically configured, it is important that the side-managed process configuration to be considered as well.  Consider the case where you want the side-managed process to use a common dynamic configuration management system but all it can do is read from property files.  In NetflixOSS this is managed via Archaius but this requires using the NetflixOSS libraries.

7.  Circuit breaking for fault tolerance to dependencies

It would nice if the sidecar could provide an approximation of circuit breaking.  I believe this is impossible to do as "cleanly" as using NetflixOSS Hystrix natively (as this wouldn't require the user to write specific business logic to handle failures that reduce calls to the dependency), but it might be nice to have some level of guarantee of fast failure of scenarios using #3.  Also, all dynamic configuration of this function (timeouts, etc) should be considered.

8.  Application level metrics

It would be nice if the sidecar provided could allow the side-managed process to more easily publish application specific metrics to the metrics pipeline.  While every language likely already has a nice binding to systems like statsd/collectd, it might be worth making the interface to these systems common through the sidecar.  For NetflixOSS, this is done through Servo.

9. Manual GUI and programmatic control

We have found the need to sometimes quickly dive into a specific instance with human eyes.  Having a private web based UI is far easier than loading up ssh.  Also, if you want to script access to the functions and data collected by the sidecar, we would like a REST or even JMX interface to the control offered in the sidecar.

This all said, I started a quick project last week to create a sidecar that does some of these functions using NetflixOSS so it integrated cleanly into our existing IBM Cloud Services Fabric environment.  I decided to do it in github, so others can contribute.

By using Karyon as a base for the sidecar, I was able to get a few of the items on the list automatically (specifically #1, #2 partially and #9).  I started with the most basic sidecar in the trunk project.  Then I added two more things:


Consul style health checks:


In work leading up to this work Spencer Gibb pointed me to the sidecar agents checks that Consul uses (which they said they based on Nagios).  I based a similar set of checks for my sidecar.  You can see in this archaius config file how you'd configure them:

com.ibm.ibmcsf.sidecar.externalhealthcheck.enabled=true
com.ibm.ibmcsf.sidecar.externalhealthcheck.numchecks=1

com.ibm.ibmcsf.sidecar.externalhealthcheck.1.id=local-ping-healthcheckurl
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.description=Runs a script that curls the healthcheck url of the sidemanaged process
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.interval=10000
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.script=/opt/sidecars/curllocalhost.sh 8080 /
com.ibm.ibmcsf.sidecar.externalhealthcheck.1.workingdir=/tmp

com.ibm.ibmcsf.sidecar.externalhealthcheck.2.id=local-killswitch
com.ibm.ibmcsf.sidecar.externalhealthcheck.2.description=Runs a script that tests if /opt/sidecarscripts/killswitch.txt exists
com.ibm.ibmcsf.sidecar.externalhealthcheck.2.interval=30000
com.ibm.ibmcsf.sidecar.externalhealthcheck.2.script=/opt/sidecars/checkKillswitch.sh

Specifically you define a check as an external script that the sidecar executes and if the script returns a code of 0, the check is marked as healthy (1 = warning, otherwise unhealthy).  If all checks defined come back as healthy for greater than three iterations, the instance is healthy.  I have coded up some basic shell scripts that we'll likely give to all of our users (like curllocalhost.sh and checkkillswitchtxtfile.sh).  Once I had these checks being executed by the sidecar, it was pretty easy to change the Karyon/Eureka HealthCheckHandler class to query the CheckManager logic I added.


Integration with Dynamic Configuration Management


We believe most languages can easily register events based on files changing and can easily read properties files.  Based on this, I added another feature configured this archiaus config file:

com.ibm.ibmcsf.sidecar.dynamicpropertywriter.enabled=true
com.ibm.ibmcsf.sidecar.dynamicpropertywriter.file.template=/opt/sidecars/appspecific.properties.template
com.ibm.ibmcsf.sidecar.dynamicpropertywriter.file=/opt/sidecars/appspecific.properties

What this says is that a user of the sidecar puts all of the properties they care about in the file.template properties file and then as configuration is dynamically updated in Archaius the sidecar sees this and writes out a copy to the main properties file with the values filled in.

With these changes, I think we now have a pretty solid story for #1, #2, #6 and #9.  I'd like to next focus on #3, #4, and #7 adding a Zuul and Hystrix based sidecar process but I don't have users (yet) pushing for these functions.  Also, I should note that the code is a proof of concept and needs to be hardened as it was just a side project for me.

PS.  I do want to make it clear that while this sidecar approach could be used for Java services (as opposed to languages that don't have NetflixOSS bindings), I do not advocate moving these functions to external to your Java implementation.  There are places where offering this function in a side-car isn't as "excellent" operationally and more close to "good enough".  I'll let it to the reader to understand these tradeoffs.  However, I hope that work in this microservice sidecar space leads to easier NetflixOSS adoption in non-Java environments.

PPS.  This sidecar might be more useful in the container space as well at a host level.  Taking the sidecar and making it work across multiple single process instances on a host would be an interesting extension of this work.