Joe Sondow's Blog

Sunday, November 11, 2012

State Is a Bug

The story so far

For those listening in on this public conversation, here's the background.

I participated in a podcast episode about Continuous Deployment at the Java Posse Roundup 2012.

Marc Esher tweeted about the podcast, asking me to elaborate further on my assertion that "state is a bug".

That led to a longer question on a GitHub gist:

Joe, thanks for responding.

I'm most interested in what are perhaps pedestrian issues, but they're issues (for me) nonetheless. You mentioned "outsourcing" session management... where can I read more on that?

Take the simple case of: I'm a user on your system. I'm logged in. I'm doing things. The server I'm on disappears while the screen I'm reading is currently loading.

What happens? Do I see an error? Am I sent to a new machine, with all my session state in tact, and the screen I was reading simply reloads?

Or take perhaps a different architecture where a machine isn't brought down until all its users have been successfully moved to other servers. For example, perhaps our configuration is such that we have N instances, and we do a rolling code push to those instances. User A is on Server 5, which has the old code. We need to get that user to Server 2, which has the new code, along with all of his state. Once all users are off of Server 5, the deploy then moves to Server 5 as well and then Server 5 is brought back into commission and accepts new requests.

So: what techniques, processes, and tools support these practices?

Thanks again.

My response got a bit long for a gist, so here it is as a blog post. Marc and I haven't met. This is just how the internet works now.

My response

Hi Marc,

Here's how I see it.

Instantaneous failure

As for the question of instantaneous failure, the server shouldn't be expected to disappear while it's handling a user's request and delivering a quick response, unless the instance truly has catastrophic power failure in that moment. In that case there's no way I know of to recover. However, I'm not really concerned with that case because I think it's sufficiently rare, not currently practical to protect against, and not significant yet, because clients still occasionally expect to do three retries before succeeding. During an electrical storm or "Take your monkey to work"-day, servers can end up getting unplugged, and clients will be affected if their requests occurred in the right instant, but I don't think that problem has been solved well yet. The solvable problem is, when the user retries and gets a healthy server's response, will the user have an appropriate experience, or will they lose all their work and need to redo all their recent actions?

So, for the duration of one request and response in a stateless protocol like HTTP, the client and server are dependent on each other, and on the internet hops between them. Keep that request-response duration short enough, and the risk of instantaneous power loss should be mitigated.

If an instance gets a software shut-down signal, it ought to be configured to shut down its web server process gracefully before permitting the operating system to shut down. Web server graceful shut down should mean refusing to accept new requests, while finishing the delivery of responses the server has already started. Therefore, "terminate instance" should usually entail automatic draining of connections before total shut down occurs.

Session state

The state that I consider a bug in any highly available system is session state, not request state. Anything you want to store in session could instead be stored in a remote shared state service such as a database, a queue, or a workflow service. Consider a case for a mature shopping site like Amazon.com or eBay.com. If you log in to your account with two laptops, and start viewing items on both laptops, the history of what you've recently viewed is generally visible to both sessions around the same time, even though you are collecting history in two different sessions, possibly on two different servers. This means that the important shopping transactions of "view item" are stored in a database shared by the two web servers your clients are talking to. If one of those servers shuts down and both of your laptops continue sending shopping requests to view more items, then some of the client traffic should get switched by a load balancer to a healthy server for the future requests. The state of the user's shopping sessions is undamaged because it is still in the shared database, visible to all server instances.

Once you move "session"-type state out of the server's session context and into a remote shared service, it becomes practical to set up auto-scaling policies to get more stateless virtual servers when you need them, and to terminate stateless virtual servers when your traffic levels drop, in order to save money when renting virtual servers from a cloud provider.

I think about this stuff a lot because I work on Asgard, the open source app deployment and cloud management app produced and used by Netflix. I talk to a lot of Netflix engineers about the need to get state out of their applications. I'm also working to get state out of Asgard itself.

Rehearsing for failure

Many Netflix services use AWS auto scaling for availability and cost savings, so they need to keep user state out of their service so servers can be used interchangeably by clients. Netflix also uses Chaos Monkey to terminate instances within an Auto Scaling Group daily during business hours just to make sure the developers are still maintaining a system that is resilient to small-scale instance failures. Amazon and all other data centers have server failures, so our best defense is to expect failures and to plan for them, and to practice automatic recovery all the time. Netflix doesn't use auto scaling for Cassandra database rings because Cassandra doesn't have a good way to handle frequent growth in the size of a cluster. However, we do use Chaos Monkey to exercise Cassandra's ability to recover completely from an occasional single instance termination.

We also have all our services in three AWS Availability Zones (data centers) so if one zone has a major problem, Netflix is generally unaffected while Reddit and Pinterest are sometimes down for hours.

Asgard's state problem

On a more personal level, I'm working to remove state from Asgard so that I can increase Asgard's server count and release new versions of Asgard for use by Netflix engineers more frequently and conveniently. My intent is to use Amazon Simple Workflow Service to store the state of each long-running automation process that gets started by an Asgard user. For example, a complex rolling push of new code to replacement instances in an auto scaling group, or a reversible push of new code to a new auto scaling group, with automated result checking and rollback on failure. The state of the long-running workflow execution will then be visible to all Asgard instances, while none of those instances needs to stay up just to finish the automation process.

Tuesday, February 7, 2012

Ye Olde Tragic Journey of Attempting to Upgrade to Grails 2.0.0

It appears that Grails 2.0.0 is not yet ready for the large, pre-existing Grails 1.3.7 application my team works on at Netflix.

First a little background. I work on Asgard, formerly known as the Netflix Application Console (NAC). Here's a slide deck and a video about it. Asgard is a Grails-based web app used internally by Netflix to manage cloud systems and deployments in the Amazon Web Services cloud. If everything goes as planned I will be open sourcing Asgard under the Apache license on Netflix's Github space later in 2012. The application has been under constant development and in general use within Netflix since early 2010. Any time Asgard has a major problem, engineers at Netflix cannot deploy their changes to production and cannot run experiments in our test environments. If you've streamed a Netflix video in the past year, the servers that show you the user interface components, grant you access to stream the video, and store your viewing history and ratings were all deployed and upgraded repeatedly using Asgard. (The video file itself comes from a Content Delivery Network (CDN), but let's not get into that.)

I want to upgrade Grails. Here are a few reasons. The new error page shows the code where the error occurred. Taglibs can use GSPs instead of Groovy strings for templating. The build system provides easy overriding of the ivy repository location, and provides hooks to remove Asgard's dependence from the Netflix build system so I can open source Asgard. Testing annotations instead of inheritance makes for better Spock tests. Plugins can be packaged as jars and retrieved from an Artifactory repository instead of a directory filled with files. More classes are supposed to be redeployed to a running server during development iterations. It all sounds great.

I strive for the cleanest user experience I can achieve. That includes the visual and REST API of the application for users, as well as the quality of the code for my fellow developers, and the email messages and logs from the server. Let me say that last part again. The logs from the server. The logs are a crucial part of the interface between my team and the Grails framework. If the signal-to-noise ratio in the logs is too low, then my team becomes dangerously trained to ignore warnings and errors in their own IDE. I consider that risk to be well worth avoiding.

In trying to upgrade Grails 2.0.0 I encountered a number of problems. For each problem I spent time investigating the details and looking for solutions. This blog entry shows the issues that I experienced. I've divided them into a few good changes that help me find mistakes in my application, some annoying problems that are endurable for the short term, and significant regressions that make Grails 2.0.0 substantially worse for me than Grails 1.3.7.

Good

Compilation errors in GSPs

Solution:
Remove copy-pasted code that probably never would have worked anyway.

TLD start up messages are gone from log

Grails 1.3.7 always wrote this in the start up log on my local machine and I was never able to configure the logging early enough in the start up process to eliminate these messages.

2012-02-07 10:25:39,605 [main] INFO digester.Digester - TLD skipped. URI: http://www.springframework.org/tags is already defined

2012-02-07 10:25:39,629 [main] INFO digester.Digester - TLD skipped. URI: http://java.sun.com/jsp/jstl/core is already defined

2012-02-07 10:25:39,648 [main] INFO digester.Digester - TLD skipped. URI: http://java.sun.com/jsp/jstl/fmt is already defined

Grails 2.0.0 no longer shows those useless messages, so that's good.

Annoying

Subclassing AbstractList causes error

During command line unit test execution, and during application runtime this error occurs on a class that extends AbstractList.

[exec] | Error Compilation error compiling [unit] tests: (class: com/netflix/nac/push/Cluster, method: super$1$clearErrors signature: ()V) Illegal use of nonvirtual function call (Use --stacktrace to see the full trace)

I tried running unit tests in the IDE in an attempt to debug the issue, but that resulted in unit tests hanging without easily uncovered explanations, so I'll skip over the detailed reasons why this is happening and just accept that I can't subclass AbstractList. It wasn't strictly necessary for my use case. It was just a nice-to-have. Not being able to subclass AbstractList in a Groovy class is a bummer and probably a regression but not a deal breaker for me.

Solution:
Don't subclass AbstractList anymore. This might be a deal breaker for other applications.

Spock plugin for Grails 2.0.0 has no stable release yet

http://grails.org/plugin/spock shows that Spock tests in Grails 2.0.0 require a SNAPSHOT version of the Spock plugin. This means that each time my project builds it could get different Spock framework code. It also means that the maintainers of Spock do not yet regard the new plugin as being fully ready for release. My project has enough Spock unit tests that this is a pretty serious trade off.

Solution:
None found.

Immutable annotation errors

I knew this one from online discussions and conferences. Grails 1.3.7 uses an older Groovy version where the groovy.lang.Immutable annotation is the only Immutable annotation available, and it's deprecated because it's a bit buggy. Grails 2.0.0 uses a later version of Groovy where the default groovy.lang.Immutable is deprecated and the improved annotation is groovy.transform.Immutable. Unfortunately there is no trivial way to make the deprecated Immutable be non-default or illegal, so everyone on the team needs to be aware of this risk for every new Immutable class. Maybe my team will eventually make a CodeNarc rule to outlaw references to the default Immutable class.

Solution:
Add this line to every source file that mentions @Immutable

import groovy.transform.Immutable

Checking for equality between a null Integer to an integer literal now throws an exception instead of returning false

Integer responseCode = checkResponse()
if (responseCode == 200) { // If responseCode is null this throws an exception in Groovy 1.8.4 in Grails 2.0.0
    proceed()
} else {
    flash.message = "Error: the response code was ${responseCode}"
}

This other blog suggests that this might be a problem with checking for equality with null values for multiple cases other than Integer. If that turns out to be the case for anything in my large application then I will have to consider this a regression and definitely not a deliberate change to Groovy. I thought maybe this was a side effect of Groovy 1.8.4's increased compiler strictness but the other blog suggests that it might just be a bug in either Groovy 1.8.4 or Grails 2.0.0. This might be a show stopper, since I often need to compare values to null variables.

Solution:
Test every case in the application and replace all the Integer variables with int variables. Refactor all cases where null used to be a useful sentinel value, and come up with a different sentinel value like -1, but write more code to check for the sentinel value and convert it to a better string like "null" or "missing" or "failed" for the resulting error strings.

int responseCode = checkResponse()
if (responseCode == 200) {
    proceed()
} else {
    flash.message = "Error: the response code was ${responseCode == -1 ? 'null' : responseCode }"
}

This isn't a good solution but it might be acceptable.

Interactive mode is the default

This may be nice when building new apps on the command line, but it's mostly a productivity hurdle when I have automated builds that need an extra start up parameter, and a dozen IDE start up configuration buttons on each developer machine, all of which need a new start up parameter. Adding a start up parameter might sound like a little thing, but actually this upgrade has required 6 new start up parameters so far just to make things work like they used to. See the end of this blog post for details. That's a lot of new cruft to manage for all of Netflix's Grails developers and builds.

--non-interactive
-server
-javaagent:/Users/jsondow/w1/Tools/groovy/grails-2.0.0/lib/com.springsource.springloaded/springloaded-core/jars/springloaded-core-1.0.2.jar
-noverify
-Dspringloaded=profile=grails
-Dgrails.log.deprecated=false

Solution:
Repetitively add the --non-interactive command line parameter to all IntelliJ configurations and all ant targets

Repeated statements in start up log

The start-up log in IntelliJ is now needlessly repetitive, hurting the signal-to-noise ratio of the log. This impacts my ability to know at a glance if a legitimate new problem has occurred.
Four distinct statements now take sixteen lines each time I start my app from IntelliJ which I sometimes need to do many times a day. Adding animated dots to a log string to show the passage of time for newbie command line users should not be worth the trade off of adding garbage to the start up log for long-time developers of mature applications.

| Configuring classpath
| Configuring classpath.
| Environment set to development
| Environment set to development.
| Environment set to development..
| Environment set to development...
| Environment set to development....
| Environment set to development.....
| Packaging Grails application
| Packaging Grails application.
| Packaging Grails application..
| Packaging Grails application...
Config: development environment
| Packaging Grails application....
| Packaging Grails application.....
Config: development environment

A little research shows the naive solution to be adding the -plain-output command line parameter to the command line string (NOT to the VM parameters string) in the IntelliJ config. That makes the logs shorter and more reasonable. However, doing this breaks IntelliJ's ability to launch a browser with the localhost:8080 home page when ready. I've set up my home page to trigger the loading of all other asynchronous cache loading service initializations, so starting up this way would require a click-wait-copy-paste-browse-wait manual procedure for each server restart in development mode. This is even worse than a noisy log, so this solution doesn't work for me.

Solution:
None found.

Regressions

printf now adds extra newline characters incorrectly between string tokens and variable tokens

printf(" Cached %5d '%s'\n", map.size(), name)

That's my one use of printf. It's great for keeping my cache loading log messages vertically lined up. The output used to look like this:

Cached 587 'us-east-1 Security Groups'
Cached 32 'us-east-1 DB Snapshots'
Cached 38 'us-east-1 DB Instances'

Now that same line of code results in this output:

Cached
587
'
us-east-1 Security Groups
'
Cached
32
'
us-east-1 DB Snapshots
'
Cached
38
'
us-east-1 DB Instances
'

Solution:
Don't use printf anymore. Use String.format() separately for each variable that needs padding. Stick the result into a GString. This bug is not a deal breaker for me but it might be for some applications that use printf more extensively.

println(" Cached ${String.format('%5d', map.size())} '${name}'")

Using a mixin causes infinite recursive calls

Infinite recursive call from using a mixin on a Java library class, resulting in a StackOverflowException each time the server starts.

| Error 2012-02-06 16:37:27,155 ["http-bio-8080"-exec-10] ERROR errors.GrailsExceptionResolver - StackOverflowError occurred when processing request: [GET] /us-east-1/autoScaling/list
Stacktrace follows:
Message: Executing action [list] of controller [com.netflix.nac.AutoScalingController] caused exception: Runtime error executing action
Line | Method
->> 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| 908 | run in ''
^ 680 | run . . in java.lang.Thread

Caused by ControllerExecutionException: Runtime error executing action
->> 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| 908 | run in ''
^ 680 | run . . in java.lang.Thread

Caused by InvocationTargetException: null
->> 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| 908 | run in ''
^ 680 | run . . in java.lang.Thread

Caused by StackOverflowError: null
->> 60 | get in org.codehaus.groovy.util.AbstractConcurrentMap$Segment
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^ 30 | get in org.codehaus.groovy.util.AbstractConcurrentMap

org.codehaus.groovy.grails.web.errors.GrailsWrappedRuntimeException
Caused by: org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Executing action [list] of controller [com.netflix.nac.AutoScalingController] caused exception: Runtime error executing action
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Runtime error executing action
... 3 more
Caused by: java.lang.reflect.InvocationTargetException
... 3 more
Caused by: java.lang.StackOverflowError
at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.get(AbstractConcurrentMap.java:60)
at org.codehaus.groovy.util.AbstractConcurrentMap.get(AbstractConcurrentMap.java:30)

The error message is entirely unhelpful. Debugging eventually pointed to a mixin class with no obvious problems my team could discern.

Solution:
For now, remove all use of mixins and go back to a clunkier metaclass syntax for monkey patching classes. This is a large step backward. Maybe there is something else going on, but this would fix the problem if I chose to proceed with the upgrade right now.

Out of memory exception during build

I habitually avoid using any plugins I can live without for the sake of easier upgrades and reduced detective work in the face of memory problems and mysterious errors. However, other contributors to the application have added some plugins they wanted to try, although the use of the plugin never got finished enough to justify its continued presence.

Solution:
Delete all plugins and references to them for now. When I need to add some back I'll increase the memory allocated to the build process. This isn't a suggestion for other people, but it's the simplest band-aid for the problem in my particular application while I continue to experiment with Grails 2.0.0.

Unhelpful warnings in the start up logs about Grails calling deprecated Grails methods

jsondow@lgmac-jsondow:~/w1/webapplications/nac/main$ grails run-app --non-interactive
| Packaging Grails application...
Config: development environment
| Packaging Grails application.....
Config: development environment
2012-01-30 11:14:25,445 [main] WARN util.GrailsUtil - [DEPRECATED] Method ApplicationHolder.setApplication(application) is deprecated and will be removed in a future version of Grails.
| Running Grails application
2012-01-30 11:14:30,124 [Thread-9] WARN util.GrailsUtil - [DEPRECATED] Method ApplicationHolder.setApplication(application) is deprecated and will be removed in a future version of Grails.
2012-01-30 11:14:35 PST EmailerService Initializing...

Warnings in the start up log are important. If my code is doing something hazardous I want to know about it and solve it. Spurious framework warnings add dangerous noise to my start-up logs, increasing the risk that people on my team will ignore important messages in the log. Grails 2.0.0 unhelpfully warns me that the framework is calling its own deprecated methods, and there appears to be nothing I can do about it.

My first approach was to change the logging level to error for the amusingly named grails.util.GrailsUtil class. However, reading the GrailsUtil source code shows that I could also suppress those warnings by passing grails.log.deprecated=false into the JVM as a system property. I'll opt for the latter solution because my company has many Grails projects that share a common set of ant targets. Suppressing these spurious start-up warnings in general seems to me like a better default. The trade-off is that if any application code directly calls one of the deprecated APIs we won't get a helpful warning. I think that's a lesser risk than polluting our start-up logs with noise that encourages developers to ignore real problems.

Solution:
Add -Dgrails.log.deprecated=false to the GRAILS_OPTS environment variable in all IntelliJ configurations and to the ant file shared by most of the company's Grails apps.

Useless log warnings about ehcache

The start up log now contains new useless warnings.

2012-02-06 16:30:03,243 [Thread-10] WARN hibernate.AbstractEhcacheRegionFactory - Couldn't find a specific ehcache configuration for cache named [org.hibernate.cache.UpdateTimestampsCache]; using defaults.
2012-02-06 16:30:03,261 [Thread-10] WARN hibernate.AbstractEhcacheRegionFactory - Couldn't find a specific ehcache configuration for cache named [org.hibernate.cache.StandardQueryCache]; using defaults.

Asgard doesn't use ehcache, domain objects, or hibernate. It mainly interacts with Amazon Web Services APIs rather than any traditional database. These crufty warnings are particularly meaningless and mysterious for me. They should not occur by default.

Solution:
Add the following to the log4j section of Config.groovy.

// Suppress otherwise unavoidable warnings in Grails start up logs
error 'net.sf.ehcache.hibernate.AbstractEhcacheRegionFactory'

Action chain usage throws NullPointerException

A NullPointerException gets thrown by hard-to-identify plumbing code when calling this method for a validation failure in a controller.
chain(action: create, model: [cmd: cmd], params: params)

Result:

| Error 2012-01-30 14:02:01,406 ["http-bio-8080"-exec-4] ERROR errors.GrailsExceptionResolver - NullPointerException occurred when processing request: [POST] /application/save - parameters:
_action_save:
cmc:
alertingServiceKey:
monitorBucketType: application
description: Testing nac
name: hellojsondow
owner: jsondow
type: Web Service
Stacktrace follows:
Message: null
Line | Method
->> 161 | doCall in com.netflix.nac.ApplicationController$_closure6
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
| 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
| 908 | run . . in ''
^ 680 | run in java.lang.Thread

Printing out the values in the chain map shows that none of the values are null.

println create
println([cmd: cmd])
println params

com.netflix.nac.ApplicationController$_closure5@5b736d8
[cmd:com.netflix.nac.ApplicationCreateCommand@3e420e73]
[_action_save:, cmc:, alertingServiceKey:, monitorBucketType:application, description:Testing nac, name:hellojsondow, owner:jsondow, type:Web Service, action:save, controller:application]

Something is going wrong under the hood in the chain method, which has never had a problem before. Googling leads to the Grails chain documentation which does not illuminate the error, but instead indicates that action is still supposed to be what it was in Grails 1.3.7: "The action to redirect to, either a string name or a reference to an action within the current controller." More googling leads to irrelevant-looking NullPointerExceptions. Possibly similar is this report but there is no chain call there.

For a few hours this one seemed like it could be a deal breaker. It defies all attempts at traditional debugging. I resorted to stabbing wildly in the dark by ignoring the documentation and my own better judgement and trying a string to identify the action instead of a direct reference to the action closure variable. This fixed the problem, although it's inferior code, since a string will not be checked by my IDE until runtime, so it is likely to take longer to discover a typo.

While trying to debug through trial and error, it became clear that hot code replacement of controller code to the running server through IntelliJ no longer works reliably in Grails 2.0.0 on my machine. It just fails silently. Each experimental code change in IntelliJ requires a server restart. This eradicates the most significant of Grails' productivity wins for an existing application.

Solution:
Wrap all action values in single quotes. Despite what the Grails 2.0.0 documentation says for chain, the action value must be a String, not a Closure.

Hot code replacement is broken in IntelliJ for Grails 2.0.0

Hot code replacement of a controller class from IntelliJ 10 no longer works. This may be the biggest problem of all. The loss of hot code replacement from my IDE would be such a large step backwards in productivity that I would need to start looking for an alternative to Grails that can do hot code replacement for most classes in IntelliJ.

Googling the problem reveals these discussions and temporary solutions. It's a problem with Grails 2.0.0 running in IntelliJ 10 and IntelliJ 11.
http://grails.1312388.n4.nabble.com/Grails-2-0-RC1-Auto-Reloading-td4023792.html
http://dattein.com/blog/intellij-not-hot-deploying-grails-application/
It's a pretty unfortunate workaround but it gets the job done if I really want Grails 2.0.0 and IntelliJ right away.

Solution:
Add this long magic spell to the VM parameters field of all IntelliJ configurations:
-server -javaagent:/Users/jsondow/w1/Tools/groovy/grails-2.0.0/lib/com.springsource.springloaded/springloaded-core/jars/springloaded-core-1.0.2.jar -noverify -Dspringloaded=profile=grails

Native instrumentation error during start up

Now that I'm using the javaagent parameter to enable springloaded my start up logs contain about 5 to 15 lines like these:

*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806

This didn't happen in early experiments. I tried a thorough clean and rebuild on the command line but it's still happening on my machine. Googling it shows various old Java complaints about other frameworks.

If I remove the springloaded jar parameters shown above then this problem goes away, but then I can't iterate on my code using a running server anymore, rendering Grails somewhat useless for development.

Solution:
None found.

The command line and VM parameters for Grails 2.0.0 are much more numerous and complicated than for Grails 1.3.7

Solving some of the problems above introduces a new problem. Configuration management just got a lot more error prone.

My old most common IntelliJ configuration was the following. It includes a couple of optional Asgard flags I made up to reduce start up time.
Command line: run-app
VM parameters: -XX:MaxPermSize=128m -Dserver.port=8080 -DskipCacheFill=true -DonlyRegions=us-east-1,eu-west-1

To solve problems in Grails 2.0.0 that configuration needs to become the following in order to do the same thing.
Command line: run-app --non-interactive
VM parameters: -server -javaagent:/Users/jsondow/w1/Tools/groovy/grails-2.0.0/lib/com.springsource.springloaded/springloaded-core/jars/springloaded-core-1.0.2.jar -noverify -Dspringloaded=profile=grails -Dgrails.log.deprecated=false -XX:MaxPermSize=128m -Dserver.port=8080 -DskipCacheFill=true -DonlyRegions=us-east-1,eu-west-1

The reloading workaround requires a unique path to the springloaded-core jar for every developer. We have several full time and several part time developers on the project, with plans to open source the code to hundreds of develoeprs. The need for so much start up configuration just for the framework is a step further away from the ease of use that Grails 1.3.7 provided.

Solution:
None found.

Not upgrading yet

My team agreed it was worth the exploration phase to document the problems I found so far. This will be a starter reference for us when Grails 2.0.1 or Grails 2.0.2 get released so I can try again to have a more positive upgrade experience. The mixin bug, the IntelliJ reload bug, the action closure bug, and the reliance on a snapshot version of the spock plugin are enough reasons for this upgrade to be more of a loss than a win for me. I'll try again with a future version of Grails.

Even if the bugs were fixed in Grails/IntelliJ there are still some design decisions with which I disagree. The need for a --non-interactive flag seems to me like a poor decision. The interactive mode in Grails 1.3.7 existed but was opt-in. Making it opt-out makes it seem like Grails is trying to cater mostly to new developers working on the command line, with less regard for long-time Grails developers working on mature applications using an IDE and automated Jenkins builds. The repetitive logs in a Jenkins build or an IDE build reinforce this feeling. Animated dots to show the passage of time for new developers on the command-line are not a good enough reason to muddy up the start up logs for other developers.

There are several tens of thousands of issues in the Grails Jira backlog. It's a bit challenging to identify which Jira tickets may relate to any of the issues I've encountered. If I find or create relevant Jira tickets I'll link to them from my blog.

Here's someone else's story about why large existing Grails apps might not want to upgrade to Grails 2.0.0: http://schneide.wordpress.com/2012/01/23/upgrading-your-app-to-grails-2-0-0-better-wait-for-2-0-1/

Please tell me why I'm wrong on Twitter @joesondow or in the comments below.

Monday, October 11, 2010

Silicon Valley Code Camp 2010

Back in May 2010 I moved from New York to Silicon Valley to be among the densest population of software developers on Earth. There might be denser one in Alpha Centauri but their light signals haven't reached our telescopes. Yet.

One of the main reasons I wanted to be in California was for the vibrant tech community. They put on some great conferences with amazing frequency. This weekend was a massive free developer conference called Silicon Valley Code Camp.

Several thousand developers signed up for SVCC and almost half of them showed up at Foothill College in Los Altos Hills. I ran into a few friends and I met some new people. One thing that really stood out was the unusually high percentage of women at this developer conference.

There were some good talks I went to on developing and testing HTML5 and jQuery, designing user experiences, coding in Scala, and continuous deployment.

That's right, continuous deployment. Not continuous integration. Continuous deployment. Love. It's an awesome practice that Adam Rosien and Eishay Smith at kaChing are following with gusto. Adam admitted they hadn't yet solved the problem of good front-end code coverage with Selenium, mostly because they hadn't yet gotten their Selenium builds sufficiently parallelized. Been there. I suspect they'll end up using Test Swarm or Sauce OnDemand when they get around to it.

I'm getting increasingly interested in Test Swarm and QUnit, since they sound like they maybe, possibly, hopefully, kind of might be able to do what Selenium cannot yet do, which is run quickly and reliably and with minimal maintenance. Don't get me wrong, I have a huge crush on Selenium, but it's very high maintenance and it sometimes forgets my birthday. I can't help but notice QUnit and Test Swarm. After Kevin Nilson's talk it sounds like all three might work together well.

However, what I'm immediately jazzed to start using is the jQuery Templates plugin. I knew it was coming, based on the buzz on the The Official jQuery Podcast and yayQuery Podcast but I didn't realize (a) it's already released, (b) how awesome it is, (c) it was one of Microsoft's contributions. Hey, I'm a regular Java developer who knows Microsoft's reputation, but they did invent Ajax. They're not all bad, especially these days. Microsoft is like Xena and Angel. They used to be the big bad, but they've started to see the light. Will they save us when Google becomes evil? But I digress. Doris Chen showed off some new jQuery features, whetting my appetite to really ajaxify the web app I'm in charge of at Netflix. Doris was an evangelist at Sun before she became an evangelist at Microsoft. This fact caused the audience to make scandalous noises like a shocking soap opera scene was unfolding.

I'll talk about Netflix another time. Or follow me on Twitter if you're curious. I don't blog much. I mostly tweet.

My good friend Dave Briccetti did a talk where he showed the code for his Scala Lift app BirdShow that he made for his mom's web site of gorgeous wildlife photos. This was one of the better Scala talks I've seen, because it focused on clean, practical usages of Scala instead of showing the most advanced features that Scala newbies can barely understand. After Dave's talk I got to hang around with him and work on some JavaScript.

Mark Miller gave a rousing talk on The Science of Great UI. I know some immediate changes I'll be making at work based on Mark's advice and examples. For instance, tables look better in Excel than Word because the lines are less important than the data so they should be lighter. I'm glad I took a lot of photos of slides, since the slides are not so easy to find for most speakers.

What was your favorite tech conference?

Sunday, October 4, 2009

Fancy Hudson Email

Over lunch I was discussing email overload with my coworker Scott. I already use automatic color-coding and move-to-folder rules to help me find the important stuff at a glance. Still there are some automated emails that I must read, but which are time-consuming to deal with.

One such email is the Hudson build result. Hudson is a fantastic tool for continuous integration, but its email format leaves a lot to be desired:

From there I have to click the link, wait for a slow loading Hudson page, then click another link to view the full test results. Then I have to sift through the tests that only failed once because of environmental quirks and Selenium bugs, to identify the recurring test failures that need prompt attention.

What I really want is something like this:

The subject line tells how many tests have failed. The recurring test failures are listed with direct links to each failure message. If there are no recurring failures then the test is marked a success. The duration is red for tests that took over a minute. If the test name and error message are short enough, the error message is displayed.

Fortunately Scott had read a blog post with a solution that might allow us to write our own email output in Groovy. I googled it and found this gem by Chetan: Using Groovy with Hudson to send rich text email

If you read Chetan's post and the comments, most problems come from trying to show recent SCM changes in Perforce, SVN, or CVS. That's not my top priority, so I'm starting with the output that I described above. Here's how I got it working:

Download Chetan's enhanced email-ext plugin for Hudson and add it to your Hudson setup at Hudson -> Manage Hudson -> Manage Plugins -> Advanced
Upgrade Hudson to the latest version (1.326 when I started the project 2 days ago, although 1.327 came out last night, because Kohsuke Kawaguchi has superpowers and releases enhancements all the time)
Disable "E-mail Notification" and enable "Editable Email Notification"
Enable "Default Content is Script" and "Default Content Type is HTML"

Set the default subject to this Groovy template:

$DEFAULT_SUBJECT <% def tr = build.testResultAction;  if (tr?.failCount) { %>(${tr?.failCount} failures ${tr?.failureDiffString}) <% } %>

Set the default content to this Groovy template, which you'll probably want to read and edit to suit your needs:

<style>
body, table, td, th, p { font-family: Verdana,Helvetica,sans serif; font-size: 11px; }
.pane      { margin-top: 4px; white-space: nowrap; }
table.pane { border: 1px solid #BBB; border-collapse: collapse; }
td.pane    { border: 1px solid #BBB; padding: 3px 4px; vertical-align: middle; }
th         { border: 1px solid #BBB; background-color: #F0F0F0; font-weight: bold; padding: 4px; }
</style>
<%
def stillFailing = []
def rootUrl = hudson.model.Hudson.instance.rootUrl
def jobName = build.parent.name
def buildNumber = build.number
def buildUrl = "${rootUrl}job/$jobName/$buildNumber/testReport/"
if (build.testResultAction) {
    build.testResultAction.failedTests.each{tr ->
        def packageName = tr.packageName
        def simpleClassName = tr.simpleName
        def testName = tr.safeName
        def displayName = tr.className+"."+testName
        def duration = tr.durationString;
        if (duration.contains(" min")) {
            duration = """<font color="red">""" + duration + "</font>"
        }
        def url = "${rootUrl}job/$jobName/$buildNumber/testReport/$packageName/$simpleClassName/$testName"
        def error = (tr.errorDetails && tr.errorDetails.length() < 30 && displayName.length() < 100) ? tr.errorDetails : ""
        error = error.replaceAll("<", "&lt;")
        def failMap = [displayName:displayName,url:url,age:tr.age,error:error,duration:duration]
        if (tr.age > 1) {
            stillFailing << failMap
        }
    }
    stillFailing = stillFailing.sort {it.displayName}
}
def emailHeader = stillFailing.size() > 0 ? "Recurring Failed Tests" : "Success"
%>
$PROJECT_NAME - Build # $BUILD_NUMBER - $BUILD_STATUS:<br/>
<br/>
Check <a href="${buildUrl}">${buildUrl}</a> to view the results.<br/>
<h2>${emailHeader}</h2>
<table class="pane">
    <tr>
        <th>Test Name</th>
        <th>Duration</th>
        <th>Age</th>
    </tr>
    <% stillFailing.each { failedTest-> %>
        <tr>
            <td class="pane"><a href="${failedTest.url}">${failedTest.displayName}</a>&nbsp;&nbsp;${failedTest.error}</td>
            <td class="pane" style="text-align: right;">${failedTest.duration}</td>
            <td class="pane" style="text-align: right;">${failedTest.age}</td>
        </tr>
    <% } %>
</table>

Chetan has submitted a patch for Hudson to have this ability without hacking the email-ext plugin. Maybe someday this will be even more trivial to set up.

It's easier to edit the Groovy code in the Hudson configure page textarea if that textarea is wider and uses a monospace font. To that end, I've written a Greasemonkey script called monospace-hudson that makes those changes when the configure page loads.

Without monospace-hudson

With monospace-hudson

Tuesday, March 17, 2009

Presenting Open Spaces and Lightning Talks

Last week I was getting over a cold at work when the CEO came to talk to me. For the company meeting she wanted employees to present on the conferences they had attended recently. I was still feeling a little under the weather, but I always enjoy telling people about the Java Posse Roundup and the ideas and experiences I was exposed to there. So I gave her a synopsis about Open Space conferences and lightning talks. She liked the ideas and asked me to present them to the company. I agreed.

The meeting began and soon moved on to the employee presentations. I was tired and weak from being sick the day before, but I wanted to communicate the Open Space ideas. I didn't know how many presentations there would be or when mine would be. The first few presentations went by, with detailed slides and lots of information about online advertising trends and healthcare marketing topics. I got increasingly nervous, since I had no slides and no written outline, and I felt more like going home and napping than improvising for a large audience.

Finally the CEO introduced my talk at the end. Relieved to stop thinking and start talking, I began following Jared Richardson's advice from his Career 2.0 talk at No Fluff Just Stuff.

• For each point you make, look a different audience member in the eye.
• Pace slowly back and forth across the stage area, to keep people's eyes moving and their attention focused.
• Use big arm gestures.
• Tell jokes.
• Modulate your voice high and low at different times.

I explained how an Open Space conference works, how they've been covered by Business Week as a new way for conferences to educate like-minded people with an unconventional approach. I described the morning sessions at the Java Posse Roundup, where the attendees posted ideas for discussions on post-it notes and then met and recorded our discussions for the podcast. I talked about afternoons where we either went skiing and chatted about programming, or went to someone's house and did some coding to learn a new language. And then I talked about the evenings, where we did lightning talks.

Fortunately it was a subject I'm already passionate about so it was easy to make it interesting. One of the ideas I wanted to sell was that a session of lightning talks should include a few talks that are included only to keep people amused and interested. The point is for their attention not to wander too far and to increase the audience's retention of the material. The inclusion of the "just for fun" talks has been slightly difficult for my teammates to swallow, so I wanted to address it specifically. As one coworker put it "If I have a deadline, why would I want to go in a conference room and listen to you talk about racecars?" As I explained to my audience, the point is to inspire creative thinking and self-expression, to get people practicing energetic public speaking, and to keep everyone awake and amused so they'll be more likely to retain the information from other lightning talks. And the most important part... if you don't like a presentation, it's only 5 minutes. Just wait for the next one.

The audience laughed when I showed them the example of Andrew Harmel-Law's "How to Prepare for Zombie Attacks" lightning talk on YouTube. They applauded when I pointed out that they were laughing and would therefore probably remember some of the other points I made.

The only question from the audience was "So how DO we prepare for zombie attacks?"

Wednesday, March 11, 2009

Selenium

The first of my 5-minute lightning talks has been posted to the Java Posse YouTube channel. Thanks to Carl Quinn for the videotaping and processing.

Comment on this video

Summary

Selenium is a web site testing tool that lets you build repeatable tests to ensure that all the important functionality of your web application still works after you've been altering your code base. The master copy of a test is saved as an HTML file, which can be edited easily in the Firefox plugin called the Selenium IDE. From there it can be saved as Java code to execute in a JUnit test, or in other languages like PHP, C#, Ruby, Perl, Python.

The Good News

Most interesting web applications depend on JavaScript, properly named hidden form inputs, and the integration of many systems working together in real time. The best way I've seen to test an entire system in a real browser is to create a collection of Selenium tests and run them automatically as a JUnit test suite. If you habitually write and maintain Selenium tests for all new and altered functionality on your web site, and run those tests a few times a day using a continuous integration system like Hudson, you'll have a solid test bed that tells you as soon as someone on your team breaks something unexpectedly.

The Bad News

Selenium is young and finicky. It doesn't want to eat its brussels sprouts and it sometimes cries for a glass of water in the middle of the night. Because it runs in a real browser, it is subject to various meaningless error states that I've seen over the past 2 years of using it, including:

Random timeouts waiting for pages to load
Inability to use secure pop-up windows in IE
Overloaded CPU
Failure to start if Firefox is already running
A browser alert hangs the entire test suite indefinitely

Conclusion

Despite these monkey wrenches, I adore Selenium for its assurance that my web site works correctly before each release. It doesn't catch everything that a human QA team can find, but it does catch problems that a QA team doesn't have time to quadruple check every single week. You will probably need to write enough test utilities to restore the system to a known, deterministic state for each test. You'll also need to get your team in the habit of fixing a few tests for each significant code change, and learn to live with the fact that you won't always know who broke a test. Just fix it anyway. Once all that is in place, then you begin to have strong confidence that all your assumptions about your web site's functionality are still true six months later.

Sunday, March 8, 2009

Immediate Ideas from JPR09

My 28 seconds of fame has arrived. If you're not already a Java Posse podcast listener, let episode 233 be your introduction to this excellent resource, and your chance to hear my radio voice as the Java Posse Roundup 2009 attendees all take a chance with the mic.

I'm trying to recall the best ideas I learned at the roundup last week, for prompt use at my workplace, Marketing Technology Solutions. Some of the ideas about how to help get back developer time were interesting, especially the focus on automating more processes. I look forward to reviewing more of the audio sessions in future weeks and months to extract more ideas.

However, there are three ideas that made my eyes light up when I heard them, and they're relatively simple to get started at work almost immediately after selling them to the team.

1. Project Retrospectives

After each code release, allow a full regular day to fix any surprising production issues. Then on the next day, run through a 1 to 3 hour code review and functionality retrospective. During this meeting each team member who completed anything interesting can show the primary changes they made to the code base and the improved functionality they created. This can help other team members learn how areas of the code are evolving that they haven't seen recently, and it can help us learn from each other's challenges and the solutions we found.

2. Regular Discussions with Employees

Spend 30-60 minutes each week with each employee to find out privately if there are any ideas or problems that deserve some attention from the employee's perspective. This should preferably be done far away from the office building to encourage unchecked speech about office troubles. It's reasonable to use a full or partial lunch break for the meeting if the employee wants, or just go for a walk during work hours.

3. Lightning Talks

Once a month our team can have an hour-long lunch-and-learn session of 5 minute lightning talks on any technical or non-technical subjects. Keeping things educational and entertaining helps promote free thinking and creativity, so non-technical talks are just as valuable as technical ones, as long as they're interesting or funny. This is especially helpful for team members who are not yet comfortable speaking in front of a group, which is an increasingly important skill for a developer. To get things started I'll need to prepare a few short talks on technical and non-technical subjects, with and without slides. This can help others see the flexibility of the format so they feel free to present whatever and however they prefer.

The lightning talks at the Java Posse Roundup 2009 were very successful. They went as follows:

TUESDAY
"Fair Allocation" Algorithm
Selenium Intro
How I Became 3/4 of the Man I Was
The Art of the Photo
High Gear Media and GWT
Loop Quantum Gravity
Animation Inside JavaFX
Racing 101
Awesome Productivity with GMail
ScalaCheck Automated Tests
Simple Twitter Client in Scala
Your Eyes Suck at Blue

WEDNESDAY
Why Are There 12? or The Other Staircase
DB Migration in Java
Dynamic Web Skinning
Centerline Soccer
YQL
Java User Groups: How to Start One?
Sophisticated Data Access with JPA and Spring
F1 KERS System (2009)
Scala + Wicket
Surprise?
Slide Rules for Fun and Profit
Call 811
Repository Management

THURSDAY
Fan
jFlubber, FlexFlubber, FXFlubber
The Smallest Plugin System
Hacking Hardware
Helmet Cam Footage
Zombies: Are You Prepared?
JavaScript Shell
Recovering a Stolen Laptop with Flex
Doctor Who
Sexier Software with Flex
Solar Power for Your House
Scala and JavaRebel
Groovy SwingBuilder, Google Maps, YQL Mashup

The JPR09 lightning talks are gradually getting released by Carl Quinn on http://www.youtube.com/javaposse

Joe Sondow's Blog

Sunday, November 11, 2012

State Is a Bug

The story so far

My response

Instantaneous failure

Session state

Rehearsing for failure

Asgard's state problem

Related Netflix tech blog posts

Tuesday, February 7, 2012

Ye Olde Tragic Journey of Attempting to Upgrade to Grails 2.0.0

Monday, October 11, 2010

Silicon Valley Code Camp 2010

Sunday, October 4, 2009

Fancy Hudson Email

Without monospace-hudson

With monospace-hudson

Tuesday, March 17, 2009

Presenting Open Spaces and Lightning Talks

Wednesday, March 11, 2009

Selenium

Sunday, March 8, 2009

Immediate Ideas from JPR09

Blog Archive

About Me

Followers

Sunday, November 11, 2012

State Is a Bug

The story so far

My response

Instantaneous failure

Session state

Rehearsing for failure

Asgard's state problem

Related Netflix tech blog posts

Tuesday, February 7, 2012

Ye Olde Tragic Journey of Attempting to Upgrade to Grails 2.0.0

Monday, October 11, 2010

Silicon Valley Code Camp 2010

Sunday, October 4, 2009

Fancy Hudson Email

Without monospace-hudson

With monospace-hudson

Tuesday, March 17, 2009

Presenting Open Spaces and Lightning Talks

Wednesday, March 11, 2009

Selenium

Sunday, March 8, 2009

Immediate Ideas from JPR09

Subscribe To

Blog Archive

About Me

Followers