Monday, February 7, 2011

(Part 2) AppEngine Task Queue + Transactions behavior

In Part 1 of this story I had made some observations about task execution with respect to data modified in a transaction. I have now formally isolated some test cases which reveal inconsistent data. I have deployed a demo application on appengine, and shared the source code. The code is an eclipse project and can be checked out directly into eclipse.

Here is a screen shot of the AppEngine Test Jig (opened on the tab for testing Transactions and Queue timing):

AppEngine Test Jig

In summary, there are two tests. Test 1 will demonstrate ill behavior, and test 2 will demonstrate a workaround. You can push the Test button of the appropriate test, and when you think it might be done, you can push the “Get Test Results” button, which will return the current stats measuring the test. Rather than pushing this button repeatedly to monitor progress, you can first press the “Open Feedback Channel” button to see feedback from the server in realtime. (Note: This uses the Channel API. After two hours, your channel will be gone. Just reload the page.) You can push the “Echo” button to test that the channel is working. It simply pushes an echo message from the server to your browser.

The actual test is simple. There is state machine that may be advanced until it reaches its end state. It is implemented with a single integer in an object that is persisted to the DataStore via JDO. That integer is incremented by one as long as it is less than 30, which is considered the end state. The advancement of the state machine is performed within a transaction, and within that same transaction a task is queued that will perform the next advancement of the state machine. This two step pattern (modify state, then queue task) makes a simple atomic unit of operation. These atomic units can be chained together to drive the state machine to its final state.

What are we testing for? The hypothesis is that although we successfully commit our transaction, the data is not yet written, and the queued task can act on old state data. (Bad!)


Perform Test 1 (Will demonstrate the “bad” behavior.)

If you have not already done so, click “Open Feedback Channel”. This will provide real time feedback of state changes.

Press “Start Test 1”.

When it completes, press “Get Test Status”.

You should observe “mismatches” occurring. If not, you just got lucky. Simply clear your console and run test 1 again.  You will see mismatches occur eventually. Here’s a sample output:


So what does this “mismatch” mean? How do you interpret this output?

Well, recall the hypothesis we are testing. During the transaction, we incremented the state by one and queued a task to do the next advancement of state. The hypothesis is that it is possible for that queued task to execute *before* the incremented data is actually written to disk, and therefore the queued task reads the incorrect (previous) state value.

We can prove it. Because we know what the next state is *supposed* to be, we simply pass that as a parameter of the queued task. When we execute the task, this “expectedState” parameter is compared to what is read from the data store, and if they don’t match, then we log and send a “mismatch” message so we can tally them. Above, we see 6 mismatches in the status area at the top of the page.

Another way of measuring the same thing is simply counting the number of invocations made to advance the state. This is the “invocations” number in the status area at the top of the page. It should be no surprise that it took 36 invocations to get our state machine to arrive at 30. We know that 6 of the times, it started with the incorrect (previous) state.

The one other number above is the “retries” number. If we are going to count the number of times a queued task is invoked, then we need to be mindful that a task execution *can* fail, and in that case it would be retried. This should not normally occur, but I had to track it to be sure these were not polluting our numbers.


Perform Test 2 (Will demonstrate a workaround)

If you have not already done so, click “Open Feedback Channel”. This will provide real time feedback of state changes.

Press “Start Test 2”. It will take longer as each state advancement will take about two seconds.

When it completes, press “Get Test Status”.

You should have no mismatches, and your output should look something like this:


Everything is clean  But I consider this a “workaround”. I would rather it simply behave how you’d expect: a task queued within a transaction should act on data consistent with the data written by the transaction you have left.


The Workaround

Test two is running exactly the same code, with two changes. First, the jdoconfig.xml for test 2 uses a 1800ms timeout on datastore writes. Second, all tasks queued to do state advancement are submitted with a parameter such that they are scheduled to run 2 seconds in the future.

This workaround appears to be working well under normal conditions. By simply waiting two seconds, the data gets written and when the task finally executes, it has the correct state available. And I would *like* to believe that if it fails to write within 1800ms, that my commit would fail, and the task would never get queued. However I fear that if that write fails, my commit is probably long gone and that task was already queued anyway. So goes the nature of a distributed DataStore I suppose – we can’t really wait around for consistency at our commit.

Here is the jdoconfig.xml entry for test 2:

<persistence-manager-factory name="transactions-manual-test2">
	<property name="javax.jdo.PersistenceManagerFactoryClass"
		value="" />
	<property name="javax.jdo.option.ConnectionURL" value="appengine" />
	<property name="javax.jdo.option.NontransactionalRead" value="false" />
	<property name="javax.jdo.option.NontransactionalWrite" value="false" />
	<property name="javax.jdo.option.RetainValues" value="true" />
	<property name="javax.jdo.option.DatastoreWriteTimeoutMillis" value="1800" />
	<property name="datanucleus.appengine.autoCreateDatastoreTxns" value="false" />


The Conclusion?

So far I have just isolated some behavior, but still lack understanding it. I need a Google App Engine person to weigh in on how data consistency is expected to behave after you successfully commit in a transaction (using Master/Slave datastore). I would also like to hear this person express their feeling about this programming pattern – modify data state, then queue task to act on this state. I feel this pattern is fundamental and must have a dependable solution available in order to use AppEngine for anything.

I am also beginning to question the role of JDO in all this behavior.  My next step would be to add another test case where I would advance the state without using JDO at all, but just the raw DataStore calls and transactions. That may be part 3 posting yet to come.


  1. There is an article on the transaction isolation of app engine here:

    This implies that a queued task, which is going to *execute* outside the transaction, can at best achieve the "Read Committed" level of isolation.

    There could still be something funky going on at the JDO layer too.

  2. This is pretty old so you have probably figured it out already, but for any readers who may stumble across this, the answer is that the High Replication Datastore only promises 'Eventual Consistency'. Here is a quote from the Developer documentation:

    "Google App Engine's High Replication Datastore (HRD) provides high availability for your reads and writes by storing data synchronously in multiple data centers. However, the delay from the time a write is committed until it becomes visible in all data centers means that queries across multiple entity groups (non-ancestor queries) can only guarantee eventually consistent results. Consequently, the results of such queries may sometimes fail to reflect recent changes to the underlying data."