Sampling Done Right: Avoiding Common Event Data Mistakes

By Pete Kurkowski

Sampling is important for any brand that wants to impactfully explore its vast amounts of data. Whether it's understanding product usage or gauging engagement in your target audience, sample data is key. And when it comes to behavioral analytics and customer journey analytics, finding the right question is often even harder than finding the right answer.

 

Sampling your data with a continuous intelligence platform supercharges the process by reducing the query response time to a few seconds even across billions of events. Brands can explore questions faster, avoid dead-end answers, and dive deeper into data. A typical exploratory session starts with a vague idea about a gem that might be lurking in the data, turns into a series of queries that zero in on the right question, and then saves the final query by pinning it to a dashboard.

 

Other solutions require sampling just to be able to work at scale. With continuous intelligence and real-time analytics, sampling is a useful tool, but not essential for fast results at a massive scale. You can obtain statistically accurate sampled results in real-time, refine your queries, and when ready, get fast unsampled results for hundreds of billions of events. Continuous intelligence platforms, like Scuba,  can ingest and store all raw events and optionally sample during the query. 

 

Let’s review some key concepts:

 

  • Sampling is the process of selecting a subset of some general population that can be used to accurately estimate important attributes of the population as a whole. That subset is called the sample.
  • Behavioral Analytics focuses on how various actors behave when interacting with a product or service. Those actors could be people, devices, sensors, etc. The behavior is tracked as a sequence of events that occur at specific times. The order, duration, and time between events are all relevant for understanding behavior.
  • Event data is data from any occurrence that has significance for a product or service. Events describe an action associated with an actor at a specific time.

Avoiding common event data mistakes

1. Keep context in mind.

Many new applications relate to connected applications with millions of users and thousands of events per user session. Many brands see hundreds of millions, or billions, of events per hour. Being able to ingest, store, and analyze all that data in terms of behavior takes a dedicated approach. General-purpose solutions for big data analytics might keep up at smaller scales, but at high volumes they’re forced to make compromises like using much more expensive clusters, taking longer to get their answers, and depending on busy data scientists to translate basic questions into code.

2. The attraction of behavioral analytics is to discover something new about users and interactions.

The discovery process is exploratory by nature, and exploration is best done interactively. There’s something compelling about diving deep into the data and seeing it in new ways and from different angles. Time to discovery is a critical measure of an analytics solution.  The good news is that there’s usually a balance between how accurate an answer needs to be, how much it costs to get a more accurate answer, and how much value additional accuracy brings to the organization.

3. Know when--and when you shouldn't--sample.

Believe it or not, sampling isn’t always appropriate. Certain data isn’t going to be evenly distributed among the shards. Some events are very rare and unlikely to show up in a sampled result. Sometimes you’re looking for a tiny set of events but aren’t sure when they occurred. Sometimes the selection filters leave too few events to sample accurately.

 

For behavioral analytics of event data, there are right and wrong ways to sample. It’s tempting to sample at data collection points. There are potential upsides: the data shrinks and gets easier to ingest, less data needs to be stored, and it can be processed as-is without further reduction. But for behavioral analytics, this approach is tricky and limited.

 

  • First, the sampled events must represent a series of actions by a set of actors. Their contents, sequence, and timing are all important. You can’t just take every 100th event.
  • Second, we don’t know which actors will be interesting ahead of time. That’s part of the discovery process. Plus, the interesting criteria may change from query to query and aren’t known in advance.
  • Lastly, sampling might reduce the collected data to such a degree that it can’t be used for important workflows like A/B testing. The fraction of sampled users shown the modified product might become too small to draw statistically meaningful conclusions.

 

For behavioral analytics of event data, the correct approach is to record all the events and make them part of the dataset. Sampling needs to be based on all the events for a representative set of actors from the population. It needs to happen at the time of the query, not during ingest. This approach moves the burden of correct sampling from the end-user and onto the analytics platform. If the answer is so clear-cut, why isn’t everybody doing it the same way?

 

The answer is implementation: A solution-focused on event data can organize and manage data in ways that don’t make sense for a general-purpose analytics solution. That organization brings the power to store and query huge volumes of event data efficiently.

Seamlessly sample and analyze data in real-time with Scuba

As a real-time continuous intelligence platform, Scuba Analytics is a purpose-built solution for behavioral analytics of event data at a massive scale. With Scuba's no-code querying and real-time analytics, teams across any company can not only conduct impactful sampling but glean essential insights.

 

Explore our demo today or schedule a call with a Scuba expert

Stay Updated

Stay in touch with Scuba with fresh insights delivered to your inbox.

Ready to Dive In?

We'd love to connect.

Talk to an Expert