« What is Cisco up to? | Main | The Big Hairy Churn Monster »

March 29, 2007

Where does Tealeaf Data Come From -- and Why

 As web applications have grown in usage, feature richness and complexity the need for understanding the end user experience has skyrocketed. This has led to a lot of different ways to try and gather insight into end user experience.

In the beginning, the web was a place to consume static content. Web servers contained access logs to show which items were requested. Since the content was static, a successful status code (i.e. HTTP 200) was adequate for knowing that information was delivered. Since the content was static, if you knew the resource that was being requested, you could also infer the content.

The Web didn't stay static for very long.

However, the methods of logging the web requests haven't changed much in well over a decade.

Web logs

  • The original standard. Contains the request for the resource along with the associated status code.
  • Requires extra work and/or extensions to get 'Post Data'. Understanding 'Post Data' is vital for transactional web sites.
  • Requires one data collector per Web server. In a large farm of web servers, this gets to be problematic.
  • Yields a server side view of the world. The page may have been requested by the user and not received or canceled mid-flight.

The main blind spot of Web logs is that a Web log has no knowledge of what content has actually been delivered to the user! If the dynamic web page contains pricing and product availability -- the web log has no record of this. If you request the page later -- pricing and availability may change  -- since pricing and availability are conditions, not lookups. (I.E. availability is a temporal attribute -- 2 days later, the concert ticket for Springsteen is no longer available) You would not know this from looking at the Web log.

PageTags

As the web grew more sophisticated, there was no longer a guarantee of having a direct path between the user's browser and the web server(s). The advent of caches and application acceleration services (Akamai et all) meant that some or all content could be served by an intermediary. In practice, this was mostly for static content.

Since 'caches' could intercept the request, the web server would not get accessed for the content, thus there was no log entry.

In addition, collecting web logs from a whole farm of servers was a tedious and sometimes error prone process.

For these and other reasons, 'PageTagging' was born. While pagetagging can take on many forms, the most common form of pagetagging takes the form of a snippet of Javascript which is automatically run every time the page is loaded.

The Javascript makes a dummy request back to the origin site (i.e. the webserver) or a third party AASP(Analytics Application Service Provider). In this request, the JavaScript can supply information to be analyzed. The Javascript can pickup environmental data from the browser (true page load times, screen resolution, locale) along with other information that the programmer was instructed to include as part of the page (shopping aisle, contents of the cart, true name of the page, etc).

Since the Javascript runs on page load, most of the issues related to caching will disappear.

With the movement towards Rich Internet Applications (Ajax/Flex) the notion of having a 'Page' for every state transition will go away. The application will then need to trigger/execute Javascript for each major state change.

The next step in the page tagging evolution was to create Javascript which actually monitors the user interaction within a page. This class of Javascript monitors DOM (Document Object Model) events -- i.e. it has the ability to understand/see every keystroke or every User Interface event (mouse move, etc).

This allows a company to see what's happening within a page -- field to field navigation, dwell time per form field, last form field entered prior to abandoning the application.

The problems of  pagetagging include:

  • The cost of putting in the tags. It takes programming, qa, staging and more.
  • The more complex tags can break the application -- since they are part of the application.
  • The job is never done, the website changes constantly -- tags need to be added to new pages, changed, etc. At some companies, the lag between requesting a new or changed page tap and actual implementation can be lengthy. This requires coordination between IT and marketing.
  • There is additional overhead on the client. Pages are now bigger to download because you have to download the Javascript. While this can be ameliorated through caching, it is still an issue.
  • The Javascript needs to talk back to the server -- creating additional bandwidth from the client and a multitude of new requests back to their server(s). If the client doesn't send the data back -- it is lost forever.
  • There are potential security and privacy issues about collecting data from the client. Data must be encrypted in transit, sensitive data such as passwords must never be collected.
  • The JavaScript can take additional CPU resources on the client.
  • Most important -- The value of a tag represents the 'pre-ordained idea' of the developer who implemented the tag. The tag is one slice of the overall picture. If the analysis of the data requires actual prices presented to the user and the tag doesn't contain it -- the opportunity for analysis is forever lost.

Application Instrumentation

In the page-tagging model, the Web page is the carrier or transport of the analytic payload. Traditionally, a good software development practice is to build additional logging capability into the application. This logging capability can yield analytic insight or extra diagnostic information.

Why should server side components need to put extra information into the Web payload? The ability to join extra information 'Out of Band' from the server side along with the user experience can be very helpful. This 'join' can take place in real time in the analytics framework -- or it can occur post-mortem through effective pivoting or slicing/dicing of the data.

For example, if the application uses log4j as a common logging framework, it can be very beneficial to join this data with the end user experience to diagnose root cause or to expose additional analytic information for subsequent analysis. You don't want to include this information on the page body.

What does Tealeaf do?

Tealeaf, by default, passively captures the complete HTTP/HTTPS stream of each and every user on a website. Tealeaf receives its data by plugging into the span port of a switch. As such, Tealeaf adds no latency, takes no CPU, adds no risk.

You can think of Tealeaf as a reverse application server, reassembling TCP packets back into the HTTP request and response. The process of gathering the hit (i.e. the HTTP request and its associated response) generates interesting metadata that can be turned into value added metrics. These metrics include timestamps for:

  • Timestamp for first byte of request
  • Timestamp for last byte of request
  • Timesetamp for first byte of response
  • Timestamp for last byte of response
  • Timestamp for last acknowledgement from the browser. This can be used to measure network latency with some degree of success.

In addition, data elements are parsed and decoded. The URL is broken down into name value pairs. Post data is also decoded into name value pairs as well.

After the hit is assembled, it is run through a pipeline of functions. Operations such as privacy (removing or encrypting sensitive data), sessionization (grouping related hits per user) and routing until the data goes into an in-memory database. As the data enters the in-memory database, rules are run against the hit and the session. These rules can scrape values out of the request or the response for subsequent scoring and thresholding. The rules can also look across the 'session object' and make determinations about the online experience using more complex logic.

This scoring of user activity yields interesting metrics which are published in dashboards. In addition, these observations can be pumped back to SaaS (ASP) web analytics products as synthetic page tags.

Creating these rules requires work, however it doesn't require changing the application. These rules run in the context of the Tealeaf CX data store.

If the application is already using page-tags -- these tags can also be used to yield additional information. For example, if the tags contained semantic information such as 'aisle' or 'category' or 'class of user', this can be easily harvested and reused within Tealeaf.

When a session is over (usually determined by a timeout), the session is persisted and the essence of the session is searchable by dynamically converting the session into an XML document and streaming the session through an embedded XML search engine.

In many of the recent articles about Business Intelligence (BI), the future is all about using search as one of the most powerful tools for discovery. Tealeaf has worked this way since its inception. Hindsight may be 20/20 -- however in this case, our foresight was pretty good too.

The secret to Tealeaf's success is that we have optimized the TCO of acquiring and managing this amount of data. In addition, we're hard at work at further exploiting the value of this data.

Tealeaf Multi-Point Capture

Tealeaf always depends upon network capture of the complete HTTP/HTTPS stream.

We can marry the HTTP/S stream with application instrumentation. In addition, we can also take the insight from within the user's browser. This allows us to capture and replay Ajax, Flash, Flex in addition to getting more insight from traditional HTML applications. Along with this, we can read other 'tags', create awareness with our rules and can even do operations like shadow browsing of  active sessions.

The ability to have all of these different types of information in one flexible container is a unique value proposition of Tealeaf.

How can you trust aggregate data?

Many times, the aggregate data leads to the observation of 'What Happened.' How many visitors to my site, how many went down this path, how many orders did I get?

This leaves two fundamental gaps.

  • Why? Why did they abandon at step 'X' and what did the site do to force this to happen? Was it price, page content, performance, confusing navigation, errors on the site or more?
  • How can you believe aggregate data if you can't drill down to the underlying data? Summarizing data has its merits -- but not at the expense of fundamental understanding.

What is the right analytic approach?

While each approach has its own relative strengths and weaknesses, I believe that only Tealeaf has the complete solution. The richest data set, the richest set of functions on the data and the biggest opportunity for exploitation and true ROI.

-Robert Wenig

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341d71b353ef00d8357719e169e2

Listed below are links to weblogs that reference Where does Tealeaf Data Come From -- and Why:

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment