I was reading an interesting article this morning on a new Washington Post prototype tool to automate fact-checking.
The Truth Teller prototype, described as a "Shazam for Truth" sounds very compelling, though taking it from prototype to a scalable solution will not be a trivial task.
But there was one paragraph in the article which jumped out at me as a deep misunderstanding of how semantic technologies work:
You can see a Truth Teller project working well with hard, numbers-driven realities since we already have companies like Narrative Science using algorithms to write sports, real estate and financial news. More difficult though is to take that algorithm and place it against soft, interpretative data. For example — and keeping things current — how sequestration will affect governmental agency X, Y or Z, if at all.
I agree with the basic premise of the paragraph - that as you move from simple, provable facts to interpretative conjecture, it becomes much harder. But it's the first part of the paragraph that I quibble with here. The ability of products like Narrative Science (or the similar Automated Insights) to convert structured data into text is completely different from the reverse process - using technology to read unstructured text and turn it into structured data.
Converting structured data to text is a fairly straighforward process. That's not to say that doing it well is easy - it's not - and companies like Narrative Science and Automated Insights are doing a very impressive job of authoring realistic text. But it's safe to assume that the error rate for that process is near zero. If I provide you with a baseball box score showing that Mike Trout went 3-for-5 with a home run and two singles, 2 RBI and 2 runs scored, you can use these technologies to state that in many different conversational means. In no cases will the information be incorrect - it's just a question of tone and writing style.
Now, take the reverse process - reading a text-based summary of the game and trying to compile a box score. There are many challenges there. Not every at-bat gets mentioned in the summary. The nuances of language means that not every mention will be understood by the tagging engine. A home run might be called a homer, four-bagger, cleared-the-bases, round tripper, goner, went yard, moon shot, dinger or even a tater. Of course, semantic tagging uses many methods to understand text, including the use of vocabularies to capture examples like these. But any semantic text tool will miss or misconstrue some.
Semantic technologies have come a long way in the past decade. We're able to do things we were only able to dream about a few years ago. But the complexity in scaling any of these prototypes to fully useful applications should never be underestimated.



