Tuesday, February 7, 2012

Ye Olde Tragic Journey of Attempting to Upgrade to Grails 2.0.0

It appears that Grails 2.0.0 is not yet ready for the large, pre-existing Grails 1.3.7 application my team works on at Netflix.

First a little background. I work on Asgard, formerly known as the Netflix Application Console (NAC). Here's a slide deck and a video about it. Asgard is a Grails-based web app used internally by Netflix to manage cloud systems and deployments in the Amazon Web Services cloud. If everything goes as planned I will be open sourcing Asgard under the Apache license on Netflix's Github space later in 2012. The application has been under constant development and in general use within Netflix since early 2010. Any time Asgard has a major problem, engineers at Netflix cannot deploy their changes to production and cannot run experiments in our test environments. If you've streamed a Netflix video in the past year, the servers that show you the user interface components, grant you access to stream the video, and store your viewing history and ratings were all deployed and upgraded repeatedly using Asgard. (The video file itself comes from a Content Delivery Network (CDN), but let's not get into that.)

I want to upgrade Grails. Here are a few reasons. The new error page shows the code where the error occurred. Taglibs can use GSPs instead of Groovy strings for templating. The build system provides easy overriding of the ivy repository location, and provides hooks to remove Asgard's dependence from the Netflix build system so I can open source Asgard. Testing annotations instead of inheritance makes for better Spock tests. Plugins can be packaged as jars and retrieved from an Artifactory repository instead of a directory filled with files. More classes are supposed to be redeployed to a running server during development iterations. It all sounds great.

I strive for the cleanest user experience I can achieve. That includes the visual and REST API of the application for users, as well as the quality of the code for my fellow developers, and the email messages and logs from the server. Let me say that last part again. The logs from the server. The logs are a crucial part of the interface between my team and the Grails framework. If the signal-to-noise ratio in the logs is too low, then my team becomes dangerously trained to ignore warnings and errors in their own IDE. I consider that risk to be well worth avoiding.

In trying to upgrade Grails 2.0.0 I encountered a number of problems. For each problem I spent time investigating the details and looking for solutions. This blog entry shows the issues that I experienced. I've divided them into a few good changes that help me find mistakes in my application, some annoying problems that are endurable for the short term, and significant regressions that make Grails 2.0.0 substantially worse for me than Grails 1.3.7.

Good

Compilation errors in GSPs

Solution:
Remove copy-pasted code that probably never would have worked anyway.

TLD start up messages are gone from log

Grails 1.3.7 always wrote this in the start up log on my local machine and I was never able to configure the logging early enough in the start up process to eliminate these messages.

2012-02-07 10:25:39,605 [main] INFO  digester.Digester  - TLD skipped. URI: http://www.springframework.org/tags is already defined
2012-02-07 10:25:39,629 [main] INFO  digester.Digester  - TLD skipped. URI: http://java.sun.com/jsp/jstl/core is already defined
2012-02-07 10:25:39,648 [main] INFO  digester.Digester  - TLD skipped. URI: http://java.sun.com/jsp/jstl/fmt is already defined

Grails 2.0.0 no longer shows those useless messages, so that's good.

Annoying

Subclassing AbstractList causes error

During command line unit test execution, and during application runtime this error occurs on a class that extends AbstractList.

[exec] | Error Compilation error compiling [unit] tests: (class: com/netflix/nac/push/Cluster, method: super$1$clearErrors signature: ()V) Illegal use of nonvirtual function call (Use --stacktrace to see the full trace)

I tried running unit tests in the IDE in an attempt to debug the issue, but that resulted in unit tests hanging without easily uncovered explanations, so I'll skip over the detailed reasons why this is happening and just accept that I can't subclass AbstractList. It wasn't strictly necessary for my use case. It was just a nice-to-have. Not being able to subclass AbstractList in a Groovy class is a bummer and probably a regression but not a deal breaker for me.

Solution:
Don't subclass AbstractList anymore. This might be a deal breaker for other applications.

Spock plugin for Grails 2.0.0 has no stable release yet

http://grails.org/plugin/spock shows that Spock tests in Grails 2.0.0 require a SNAPSHOT version of the Spock plugin. This means that each time my project builds it could get different Spock framework code. It also means that the maintainers of Spock do not yet regard the new plugin as being fully ready for release. My project has enough Spock unit tests that this is a pretty serious trade off.

Solution:
None found.

Immutable annotation errors

I knew this one from online discussions and conferences. Grails 1.3.7 uses an older Groovy version where the groovy.lang.Immutable annotation is the only Immutable annotation available, and it's deprecated because it's a bit buggy. Grails 2.0.0 uses a later version of Groovy where the default groovy.lang.Immutable is deprecated and the improved annotation is groovy.transform.Immutable. Unfortunately there is no trivial way to make the deprecated Immutable be non-default or illegal, so everyone on the team needs to be aware of this risk for every new Immutable class. Maybe my team will eventually make a CodeNarc rule to outlaw references to the default Immutable class.

Solution:
Add this line to every source file that mentions @Immutable

import groovy.transform.Immutable

Checking for equality between a null Integer to an integer literal now throws an exception instead of returning false

Integer responseCode = checkResponse()
if (responseCode == 200) { // If responseCode is null this throws an exception in Groovy 1.8.4 in Grails 2.0.0
    proceed()
} else {
    flash.message = "Error: the response code was ${responseCode}"
}

This other blog suggests that this might be a problem with checking for equality with null values for multiple cases other than Integer. If that turns out to be the case for anything in my large application then I will have to consider this a regression and definitely not a deliberate change to Groovy. I thought maybe this was a side effect of Groovy 1.8.4's increased compiler strictness but the other blog suggests that it might just be a bug in either Groovy 1.8.4 or Grails 2.0.0. This might be a show stopper, since I often need to compare values to null variables.

Solution:
Test every case in the application and replace all the Integer variables with int variables. Refactor all cases where null used to be a useful sentinel value, and come up with a different sentinel value like -1, but write more code to check for the sentinel value and convert it to a better string like "null" or "missing" or "failed" for the resulting error strings.

int responseCode = checkResponse()
if (responseCode == 200) {
    proceed()
} else {
    flash.message = "Error: the response code was ${responseCode == -1 ? 'null' : responseCode }"
}

This isn't a good solution but it might be acceptable.

Interactive mode is the default

This may be nice when building new apps on the command line, but it's mostly a productivity hurdle when I have automated builds that need an extra start up parameter, and a dozen IDE start up configuration buttons on each developer machine, all of which need a new start up parameter. Adding a start up parameter might sound like a little thing, but actually this upgrade has required 6 new start up parameters so far just to make things work like they used to. See the end of this blog post for details. That's a lot of new cruft to manage for all of Netflix's Grails developers and builds.

--non-interactive
-server
-javaagent:/Users/jsondow/w1/Tools/groovy/grails-2.0.0/lib/com.springsource.springloaded/springloaded-core/jars/springloaded-core-1.0.2.jar
-noverify
-Dspringloaded=profile=grails
-Dgrails.log.deprecated=false

Solution:
Repetitively add the --non-interactive command line parameter to all IntelliJ configurations and all ant targets

Repeated statements in start up log

The start-up log in IntelliJ is now needlessly repetitive, hurting the signal-to-noise ratio of the log. This impacts my ability to know at a glance if a legitimate new problem has occurred.
Four distinct statements now take sixteen lines each time I start my app from IntelliJ which I sometimes need to do many times a day. Adding animated dots to a log string to show the passage of time for newbie command line users should not be worth the trade off of adding garbage to the start up log for long-time developers of mature applications.

| Configuring classpath
| Configuring classpath.
| Environment set to development
| Environment set to development.
| Environment set to development..
| Environment set to development...
| Environment set to development....
| Environment set to development.....
| Packaging Grails application
| Packaging Grails application.
| Packaging Grails application..
| Packaging Grails application...
Config: development environment
| Packaging Grails application....
| Packaging Grails application.....
Config: development environment

A little research shows the naive solution to be adding the -plain-output command line parameter to the command line string (NOT to the VM parameters string) in the IntelliJ config. That makes the logs shorter and more reasonable. However, doing this breaks IntelliJ's ability to launch a browser with the localhost:8080 home page when ready. I've set up my home page to trigger the loading of all other asynchronous cache loading service initializations, so starting up this way would require a click-wait-copy-paste-browse-wait manual procedure for each server restart in development mode. This is even worse than a noisy log, so this solution doesn't work for me.

Solution:
None found.

Regressions

printf now adds extra newline characters incorrectly between string tokens and variable tokens


printf(" Cached %5d '%s'\n", map.size(), name)

That's my one use of printf. It's great for keeping my cache loading log messages vertically lined up. The output used to look like this:

 Cached   587 'us-east-1 Security Groups'
 Cached    32 'us-east-1 DB Snapshots'
 Cached    38 'us-east-1 DB Instances'


Now that same line of code results in this output:



 Cached 
  587
 '
us-east-1 Security Groups
'
 Cached 
   32
 '
us-east-1 DB Snapshots
'
 Cached 
   38
 '
us-east-1 DB Instances
'




Solution:
Don't use printf anymore. Use String.format() separately for each variable that needs padding. Stick the result into a GString. This bug is not a deal breaker for me but it might be for some applications that use printf more extensively.

println(" Cached ${String.format('%5d', map.size())} '${name}'")

Using a mixin causes infinite recursive calls

Infinite recursive call from using a mixin on a Java library class, resulting in a StackOverflowException each time the server starts.

| Error 2012-02-06 16:37:27,155 ["http-bio-8080"-exec-10] ERROR errors.GrailsExceptionResolver  - StackOverflowError occurred when processing request: [GET] /us-east-1/autoScaling/list
Stacktrace follows:
Message: Executing action [list] of controller [com.netflix.nac.AutoScalingController]  caused exception: Runtime error executing action
   Line | Method
->> 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|   908 | run     in     ''
^   680 | run . . in java.lang.Thread

Caused by ControllerExecutionException: Runtime error executing action
->> 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|   908 | run     in     ''
^   680 | run . . in java.lang.Thread

Caused by InvocationTargetException: null
->> 886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|   908 | run     in     ''
^   680 | run . . in java.lang.Thread

Caused by StackOverflowError: null
->>  60 | get     in org.codehaus.groovy.util.AbstractConcurrentMap$Segment
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
^    30 | get     in org.codehaus.groovy.util.AbstractConcurrentMap

org.codehaus.groovy.grails.web.errors.GrailsWrappedRuntimeException
Caused by: org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Executing action [list] of controller [com.netflix.nac.AutoScalingController]  caused exception: Runtime error executing action
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: org.codehaus.groovy.grails.web.servlet.mvc.exceptions.ControllerExecutionException: Runtime error executing action
... 3 more
Caused by: java.lang.reflect.InvocationTargetException
... 3 more
Caused by: java.lang.StackOverflowError
at org.codehaus.groovy.util.AbstractConcurrentMap$Segment.get(AbstractConcurrentMap.java:60)
at org.codehaus.groovy.util.AbstractConcurrentMap.get(AbstractConcurrentMap.java:30)

The error message is entirely unhelpful. Debugging eventually pointed to a mixin class with no obvious problems my team could discern.

Solution:
For now, remove all use of mixins and go back to a clunkier metaclass syntax for monkey patching classes. This is a large step backward. Maybe there is something else going on, but this would fix the problem if I chose to proceed with the upgrade right now.

Out of memory exception during build

I habitually avoid using any plugins I can live without for the sake of easier upgrades and reduced detective work in the face of memory problems and mysterious errors. However, other contributors to the application have added some plugins they wanted to try, although the use of the plugin never got finished enough to justify its continued presence.

Solution:
Delete all plugins and references to them for now. When I need to add some back I'll increase the memory allocated to the build process. This isn't a suggestion for other people, but it's the simplest band-aid for the problem in my particular application while I continue to experiment with Grails 2.0.0.

Unhelpful warnings in the start up logs about Grails calling deprecated Grails methods

jsondow@lgmac-jsondow:~/w1/webapplications/nac/main$ grails run-app --non-interactive
| Packaging Grails application...
Config: development environment
| Packaging Grails application.....
Config: development environment
2012-01-30 11:14:25,445 [main] WARN  util.GrailsUtil  - [DEPRECATED] Method ApplicationHolder.setApplication(application) is deprecated and will be removed in a future version of Grails.
| Running Grails application
2012-01-30 11:14:30,124 [Thread-9] WARN  util.GrailsUtil  - [DEPRECATED] Method ApplicationHolder.setApplication(application) is deprecated and will be removed in a future version of Grails.
2012-01-30 11:14:35 PST EmailerService Initializing...

Warnings in the start up log are important. If my code is doing something hazardous I want to know about it and solve it. Spurious framework warnings add dangerous noise to my start-up logs, increasing the risk that people on my team will ignore important messages in the log. Grails 2.0.0 unhelpfully warns me that the framework is calling its own deprecated methods, and there appears to be nothing I can do about it.

My first approach was to change the logging level to error for the amusingly named grails.util.GrailsUtil class. However, reading the GrailsUtil source code shows that I could also suppress those warnings by passing grails.log.deprecated=false into the JVM as a system property. I'll opt for the latter solution because my company has many Grails projects that share a common set of ant targets. Suppressing these spurious start-up warnings in general seems to me like a better default. The trade-off is that if any application code directly calls one of the deprecated APIs we won't get a helpful warning. I think that's a lesser risk than polluting our start-up logs with noise that encourages developers to ignore real problems. 

Solution:
Add -Dgrails.log.deprecated=false to the GRAILS_OPTS environment variable in all IntelliJ configurations and to the ant file shared by most of the company's Grails apps.

Useless log warnings about ehcache

The start up log now contains new useless warnings.

2012-02-06 16:30:03,243 [Thread-10] WARN  hibernate.AbstractEhcacheRegionFactory  - Couldn't find a specific ehcache configuration for cache named [org.hibernate.cache.UpdateTimestampsCache]; using defaults.
2012-02-06 16:30:03,261 [Thread-10] WARN  hibernate.AbstractEhcacheRegionFactory  - Couldn't find a specific ehcache configuration for cache named [org.hibernate.cache.StandardQueryCache]; using defaults.

Asgard doesn't use ehcache, domain objects, or hibernate. It mainly interacts with Amazon Web Services APIs rather than any traditional database. These crufty warnings are particularly meaningless and mysterious for me. They should not occur by default.

Solution:
Add the following to the log4j section of Config.groovy.

// Suppress otherwise unavoidable warnings in Grails start up logs
error 'net.sf.ehcache.hibernate.AbstractEhcacheRegionFactory'

Action chain usage throws NullPointerException

A NullPointerException gets thrown by hard-to-identify plumbing code when calling this method for a validation failure in a controller.
chain(action: create, model: [cmd: cmd], params: params)


Result:

| Error 2012-01-30 14:02:01,406 ["http-bio-8080"-exec-4] ERROR errors.GrailsExceptionResolver  - NullPointerException occurred when processing request: [POST] /application/save - parameters:
_action_save: 
cmc: 
alertingServiceKey: 
monitorBucketType: application
description: Testing nac
name: hellojsondow
owner: jsondow
type: Web Service
Stacktrace follows:
Message: null
   Line | Method
->> 161 | doCall  in com.netflix.nac.ApplicationController$_closure6
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
|   886 | runTask in java.util.concurrent.ThreadPoolExecutor$Worker
|   908 | run . . in     ''
^   680 | run     in java.lang.Thread

Printing out the values in the chain map shows that none of the values are null.

println create
println([cmd: cmd])
println params

com.netflix.nac.ApplicationController$_closure5@5b736d8
[cmd:com.netflix.nac.ApplicationCreateCommand@3e420e73]
[_action_save:, cmc:, alertingServiceKey:, monitorBucketType:application, description:Testing nac, name:hellojsondow, owner:jsondow, type:Web Service, action:save, controller:application]

Something is going wrong under the hood in the chain method, which has never had a problem before. Googling leads to the Grails chain documentation which does not illuminate the error, but instead indicates that action is still supposed to be what it was in Grails 1.3.7: "The action to redirect to, either a string name or a reference to an action within the current controller." More googling leads to irrelevant-looking NullPointerExceptions. Possibly similar is this report but there is no chain call there.

For a few hours this one seemed like it could be a deal breaker. It defies all attempts at traditional debugging. I resorted to stabbing wildly in the dark by ignoring the documentation and my own better judgement and trying a string to identify the action instead of a direct reference to the action closure variable. This fixed the problem, although it's inferior code, since a string will not be checked by my IDE until runtime, so it is likely to take longer to discover a typo.

While trying to debug through trial and error, it became clear that hot code replacement of controller code to the running server through IntelliJ no longer works reliably in Grails 2.0.0 on my machine. It just fails silently. Each experimental code change in IntelliJ requires a server restart. This eradicates the most significant of Grails' productivity wins for an existing application.

Solution:
Wrap all action values in single quotes. Despite what the Grails 2.0.0 documentation says for chain, the action value must be a String, not a Closure.

Hot code replacement is broken in IntelliJ for Grails 2.0.0

Hot code replacement of a controller class from IntelliJ 10 no longer works. This may be the biggest problem of all. The loss of hot code replacement from my IDE would be such a large step backwards in productivity that I would need to start looking for an alternative to Grails that can do hot code replacement for most classes in IntelliJ.

Googling the problem reveals these discussions and temporary solutions. It's a problem with Grails 2.0.0 running in IntelliJ 10 and IntelliJ 11.
http://grails.1312388.n4.nabble.com/Grails-2-0-RC1-Auto-Reloading-td4023792.html
http://dattein.com/blog/intellij-not-hot-deploying-grails-application/
It's a pretty unfortunate workaround but it gets the job done if I really want Grails 2.0.0 and IntelliJ right away.

Solution:
Add this long magic spell to the VM parameters field of all IntelliJ configurations:
-server -javaagent:/Users/jsondow/w1/Tools/groovy/grails-2.0.0/lib/com.springsource.springloaded/springloaded-core/jars/springloaded-core-1.0.2.jar -noverify -Dspringloaded=profile=grails

Native instrumentation error during start up

Now that I'm using the javaagent parameter to enable springloaded my start up logs contain about 5 to 15 lines like these:

*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806
*** java.lang.instrument ASSERTION FAILED ***: "!errorOutstanding" with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806

This didn't happen in early experiments. I tried a thorough clean and rebuild on the command line but it's still happening on my machine. Googling it shows various old Java complaints about other frameworks.

If I remove the springloaded jar parameters shown above then this problem goes away, but then I can't iterate on my code using a running server anymore, rendering Grails somewhat useless for development.

Solution:
None found.

The command line and VM parameters for Grails 2.0.0 are much more numerous and complicated than for Grails 1.3.7

Solving some of the problems above introduces a new problem. Configuration management just got a lot more error prone.

My old most common IntelliJ configuration was the following. It includes a couple of optional Asgard flags I made up to reduce start up time.
Command line: run-app
VM parameters: -XX:MaxPermSize=128m -Dserver.port=8080 -DskipCacheFill=true -DonlyRegions=us-east-1,eu-west-1

To solve problems in Grails 2.0.0 that configuration needs to become the following in order to do the same thing.
Command line: run-app --non-interactive
VM parameters: -server -javaagent:/Users/jsondow/w1/Tools/groovy/grails-2.0.0/lib/com.springsource.springloaded/springloaded-core/jars/springloaded-core-1.0.2.jar -noverify -Dspringloaded=profile=grails -Dgrails.log.deprecated=false -XX:MaxPermSize=128m -Dserver.port=8080 -DskipCacheFill=true -DonlyRegions=us-east-1,eu-west-1

The reloading workaround requires a unique path to the springloaded-core jar for every developer. We have several full time and several part time developers on the project, with plans to open source the code to hundreds of develoeprs. The need for so much start up configuration just for the framework is a step further away from the ease of use that Grails 1.3.7 provided.

Solution:
None found.

Not upgrading yet

My team agreed it was worth the exploration phase to document the problems I found so far. This will be a starter reference for us when Grails 2.0.1 or Grails 2.0.2 get released so I can try again to have a more positive upgrade experience. The mixin bug, the IntelliJ reload bug, the action closure bug, and the reliance on a snapshot version of the spock plugin are enough reasons for this upgrade to be more of a loss than a win for me. I'll try again with a future version of Grails.

Even if the bugs were fixed in Grails/IntelliJ there are still some design decisions with which I disagree. The need for a --non-interactive flag seems to me like a poor decision. The interactive mode in Grails 1.3.7 existed but was opt-in. Making it opt-out makes it seem like Grails is trying to cater mostly to new developers working on the command line, with less regard for long-time Grails developers working on mature applications using an IDE and automated Jenkins builds. The repetitive logs in a Jenkins build or an IDE build reinforce this feeling. Animated dots to show the passage of time for new developers on the command-line are not a good enough reason to muddy up the start up logs for other developers.

There are several tens of thousands of issues in the Grails Jira backlog. It's a bit challenging to identify which Jira tickets may relate to any of the issues I've encountered. If I find or create relevant Jira tickets I'll link to them from my blog.

Here's someone else's story about why large existing Grails apps might not want to upgrade to Grails 2.0.0: http://schneide.wordpress.com/2012/01/23/upgrading-your-app-to-grails-2-0-0-better-wait-for-2-0-1/

Please tell me why I'm wrong on Twitter @joesondow or in the comments below.

3 comments:

Graeme Rocher said...

Please make sure you create JIRA issues for some of the problems you are having.

Obviously problems like Spock and Intellij are not fixable by Grails itself (require updates to the plugin and IDE), however the NPE for integer and action chaining most certainly sound like valid problems and should be entered in JIRA.

The more reports we get the more likely you will be able to upgrade to 2.0.1 or 2.0.2

Thanks
Grraeme

Unknown said...

Impressively well documented account. Definitely enough to make me hold off on upgrading, even if I didn't already have the rule that open source X.0.0 versions are dead to me.

Jason Gritman said...

I've written up a bug for action chaining after reproducing it in a test application:

http://jira.grails.org/browse/GRAILS-8868

I was unable to reproduced the NPE for Integer check bug in Grails 2.0.1

Per the Spock team, their plugin will be final by the end of the week:
http://groups.google.com/group/spockframework/browse_thread/thread/e413dac3d3db8fc1

Followers