Lizzi Sassman and Martin Splitt brought on a special Google guest on their Google search off the record podcast to discuss structured data. The guest is named Ryan Levering who has been with Google for over 11 years working on structured data.
Structured Data Past At Google
In short, Ryan Levering explained that when he first started working on the structured data project, he worked on that legacy data highlighter tool in Search Console. But early on, Google seemed to try to move away from requiring us to highlight or markup our content and wanted to use machine learning to figure it all out, which Google’s Gary Illyes said back in 2017 but kind of retracted in 2018. So Google poured a lot of effort into machine learning to figure it out.
Structured Data Present At Google
But over time, Ryan said, it was “much easier �to just ask people to give us their data rather than to pull it off of the web pages.” “It’s surprisingly more accurate,” he added. So they then moved more resources into building out structured data and support documents for site owners to use and hand over the data.
But machine learning is now thrown out the window. Ryan said they still use it a lot for (1) sites that do not use structured data where Google wants to still show rich results for those, (2) for mistakes or abuse, so Google can verify what really the page is saying compared to the structured data. So Ryan said it is a “multiple pronged approach” to using structured data and machine learning for understanding it all.
So that is how Google uses it all today but what about the future.
Structured Data Future At Google
The “medium term future,” Ryan said they plan on using structured data “not just visual treatments but actually help with more understanding on the page.” Google has mentioned this before, that structured data can help Google understand the page but it is not a ranking factor. I guess Google will be working more on that. Plus, medium term future” Ryan said Google wants figure out “how to use structured data more universally in a lot of our features rather than just like here and there, scattered around.”
Long term, Google said how Google can use structured data with how Google “interprets it in general into our internal graph.” Ryan said he “would like to move to where we are adjusting more and more data through structured data-specific channels rather than necessarily conveying all of our information on the web page itself.” Basically figuring out a “cleaner way to do data transfer between data providers and Google.” How does Google do this, he said maybe by working with the large CMS platforms so they can build it into their platforms directly.
Here is the podcast embed:
Here are parts of the transcript:
Ryan Levering : So, my introduction, when I started at Google, we were working on extraction from web pages. So like doing it via ML. So we came in, and the first thing I worked on was the data highlighter product, which is externally. We were looking at web pages and pulling structured data from unstructured text, and my whole team was very into the actual ML aspects of it. So how do we extract data, which in academic circles is often called “wrapper induction”? So when you take the– you build a wrapper that can pull the data out of a template. So reverse engineer the database. But after several years of working on it, there was another project that was side by side that was extracting structured data, which became the core of what we use now.
And I became convinced, after talking to people for a long period of time that, it was much easier �to just ask people to give us their data rather than to pull it off of the web pages. It’s surprisingly more accurate. There’s other problems that can happen because of that, but it’s generally an easier thing to do. And it’s a lot less work for us, and it’s a lot better for the provider. So I came to it from ML and seeing structured data as the enemy at first. And then I was won over as a good mechanism.
So machine learning is– I see as like multiple prongs in our approach for how we get stuff. We want to use machine learning for cases where either we don’t have more information where it’s not provided for us. But it’s always going to be easier to just have the data shown to us, I think. So we will try– I think it’s like a multi-tiered approach, where you have machine learning for cases where we don’t have that data specifically. But then providers always have the option of giving us data, which usually improves accuracy, which usually gives better benefit for the actual provider. So I always see them as working side by side in an ideal world.
Most of our features over time migrate to that approach where we ingest it. Maybe we start with one approach where we’re just using ML. And then we eventually add markups so people have control. Or it’s the opposite way around. And we start– we bootstrap with markup in an eco-system approach where people are giving us data. And then we enhance coverage of the feature by adding ML long run. So, I see them as very compatible. But it’s always good to empower people who are giving you data, to have control over that. So I think it’s really important that structured data in general is part of the overall strategy so the people can actually have some control over the content that we show.
The primary challenge is that we then have to figure out a way to verify that the structured data is accurate. And sometimes this is from actual abuse. And sometimes this is just because there’s a problem with synchronicity. Sometimes people generate structured data for their websites and it becomes out of sync with the actual stuff that’s being shown visually. We see a lot of both. So there needs to be other mechanisms to figure out some balancing act where those things are enforced. So that’s the cost of structured data, I guess, is that extra checking.
Lizzi Sassman: Yeah, speaking of the work that has been done, what about the work that’s to come, the next couple of years for structured data? If you were to give us a peek into the future, what is next for structured data?
Ryan Levering: In the medium-term, I think we’re… I mean we continue to flesh out the structured data usage in terms of adding more features and looking into more ways we can use it in cooler things that are not just visual treatments but actually help with more understanding on the page, I think. And figuring out how to use structured data more universally in a lot of our features rather than just like here and there, scattered around. I think that’s what we’re looking at in a medium-term.
Long-term, I think that it’s going to play a really interesting role at interacting with the way that we interpret it in general into our internal graph. So I would like to see more machine learning, figuring out– I would like to move to where we are adjusting more and more data through structured data-specific channels rather than necessarily conveying all of our information on the web page itself. So I think that’s a much cleaner approach, particularly for some of our structured data ingestion paths. So figuring out a way to get around the actual visual representation and figuring out ways to link the structured data with the web page but not necessarily embed it on the web page. So I think there’s a cleaner way to do data transfer between data providers and Google.
I think that it will make it easier for plug-ins and CMSs to create that information particularly. Because I feel like a lot of the eco-system has moved in that direction where people aren’t implementing the structured data themselves but rather are using content creation tools. I think it’s becoming more important that we have mechanisms to work directly with those content creation tools to ingest the data in a programmatic way in order to make it fresher and easier.
Forum discussion at Twitter.