Monday, February 7, 2011

(Part 3) AppEngine Task Queue + Transactions behavior

In part 2, I had shown a demo app that isolated some “bad” behavior with respect to using a transaction for persisting data to the data store and queuing a task to act on that data. One of my conclusions at the end of that exercise was to scrutinize JDO in another test. I have now done this.

I created a third test case where I access the raw datastore without using JDO. My initial results show that using the raw datastore API for transactions and queuing tasks works as you would expect. The queued tasks are advancing the state correctly without getting “old” state values.


To run this test, open the feedback channel, then  press the “Start Test 3 (sans JDO)” button. You should see the correct state arrival without mismatches. However, I experienced “retries” because of contention on the datastore, but it behaves correctly. This implies how much quicker the raw datastore API is too. Be sure to get the final results by pressing the “Get Test Status” button. The client adds invocations as they arrive, which can have more messages arrive than actually get atomically counted on the server.  (You can accurately measure everything by analyzing log files. But trying to tally things nicely for a demo GUI is a bear!)

Also, while writing these “low level” versions of the code, I discovered that the JDO transactions I had written previously actually allow you to do another thing you shouldn’t, and quietly lets you shoot yourself in the foot! I had a JDO transaction that initialized two objects with zeros and persisted them both to the DataStore. But they did not have a common parent! With the raw API, you get slapped on the wrist at runtime telling you that the two objects must belong to the same parent to live in the same transaction. Duh – I know this from the docs, but yet that little thing slipped by while writing the JDO version of the code.

This seems broken to me. I should get a runtime error in the JDO version of the code too.

Hmmm, I think I need to check out Slim3…

(Part 2) AppEngine Task Queue + Transactions behavior

In Part 1 of this story I had made some observations about task execution with respect to data modified in a transaction. I have now formally isolated some test cases which reveal inconsistent data. I have deployed a demo application on appengine, and shared the source code. The code is an eclipse project and can be checked out directly into eclipse.

Here is a screen shot of the AppEngine Test Jig (opened on the tab for testing Transactions and Queue timing):

AppEngine Test Jig

In summary, there are two tests. Test 1 will demonstrate ill behavior, and test 2 will demonstrate a workaround. You can push the Test button of the appropriate test, and when you think it might be done, you can push the “Get Test Results” button, which will return the current stats measuring the test. Rather than pushing this button repeatedly to monitor progress, you can first press the “Open Feedback Channel” button to see feedback from the server in realtime. (Note: This uses the Channel API. After two hours, your channel will be gone. Just reload the page.) You can push the “Echo” button to test that the channel is working. It simply pushes an echo message from the server to your browser.

The actual test is simple. There is state machine that may be advanced until it reaches its end state. It is implemented with a single integer in an object that is persisted to the DataStore via JDO. That integer is incremented by one as long as it is less than 30, which is considered the end state. The advancement of the state machine is performed within a transaction, and within that same transaction a task is queued that will perform the next advancement of the state machine. This two step pattern (modify state, then queue task) makes a simple atomic unit of operation. These atomic units can be chained together to drive the state machine to its final state.

What are we testing for? The hypothesis is that although we successfully commit our transaction, the data is not yet written, and the queued task can act on old state data. (Bad!)


Perform Test 1 (Will demonstrate the “bad” behavior.)

If you have not already done so, click “Open Feedback Channel”. This will provide real time feedback of state changes.

Press “Start Test 1”.

When it completes, press “Get Test Status”.

You should observe “mismatches” occurring. If not, you just got lucky. Simply clear your console and run test 1 again.  You will see mismatches occur eventually. Here’s a sample output:


So what does this “mismatch” mean? How do you interpret this output?

Well, recall the hypothesis we are testing. During the transaction, we incremented the state by one and queued a task to do the next advancement of state. The hypothesis is that it is possible for that queued task to execute *before* the incremented data is actually written to disk, and therefore the queued task reads the incorrect (previous) state value.

We can prove it. Because we know what the next state is *supposed* to be, we simply pass that as a parameter of the queued task. When we execute the task, this “expectedState” parameter is compared to what is read from the data store, and if they don’t match, then we log and send a “mismatch” message so we can tally them. Above, we see 6 mismatches in the status area at the top of the page.

Another way of measuring the same thing is simply counting the number of invocations made to advance the state. This is the “invocations” number in the status area at the top of the page. It should be no surprise that it took 36 invocations to get our state machine to arrive at 30. We know that 6 of the times, it started with the incorrect (previous) state.

The one other number above is the “retries” number. If we are going to count the number of times a queued task is invoked, then we need to be mindful that a task execution *can* fail, and in that case it would be retried. This should not normally occur, but I had to track it to be sure these were not polluting our numbers.


Perform Test 2 (Will demonstrate a workaround)

If you have not already done so, click “Open Feedback Channel”. This will provide real time feedback of state changes.

Press “Start Test 2”. It will take longer as each state advancement will take about two seconds.

When it completes, press “Get Test Status”.

You should have no mismatches, and your output should look something like this:


Everything is clean  But I consider this a “workaround”. I would rather it simply behave how you’d expect: a task queued within a transaction should act on data consistent with the data written by the transaction you have left.


The Workaround

Test two is running exactly the same code, with two changes. First, the jdoconfig.xml for test 2 uses a 1800ms timeout on datastore writes. Second, all tasks queued to do state advancement are submitted with a parameter such that they are scheduled to run 2 seconds in the future.

This workaround appears to be working well under normal conditions. By simply waiting two seconds, the data gets written and when the task finally executes, it has the correct state available. And I would *like* to believe that if it fails to write within 1800ms, that my commit would fail, and the task would never get queued. However I fear that if that write fails, my commit is probably long gone and that task was already queued anyway. So goes the nature of a distributed DataStore I suppose – we can’t really wait around for consistency at our commit.

Here is the jdoconfig.xml entry for test 2:

<persistence-manager-factory name="transactions-manual-test2">
	<property name="javax.jdo.PersistenceManagerFactoryClass"
		value="" />
	<property name="javax.jdo.option.ConnectionURL" value="appengine" />
	<property name="javax.jdo.option.NontransactionalRead" value="false" />
	<property name="javax.jdo.option.NontransactionalWrite" value="false" />
	<property name="javax.jdo.option.RetainValues" value="true" />
	<property name="javax.jdo.option.DatastoreWriteTimeoutMillis" value="1800" />
	<property name="datanucleus.appengine.autoCreateDatastoreTxns" value="false" />


The Conclusion?

So far I have just isolated some behavior, but still lack understanding it. I need a Google App Engine person to weigh in on how data consistency is expected to behave after you successfully commit in a transaction (using Master/Slave datastore). I would also like to hear this person express their feeling about this programming pattern – modify data state, then queue task to act on this state. I feel this pattern is fundamental and must have a dependable solution available in order to use AppEngine for anything.

I am also beginning to question the role of JDO in all this behavior.  My next step would be to add another test case where I would advance the state without using JDO at all, but just the raw DataStore calls and transactions. That may be part 3 posting yet to come.