The International Journal 
of Newspaper Technology

Home  | Newspapers & Technology | Prepress Technology | Online Technology | IFRA/International News
 | Free Subscription | Contact Us | Newspaper Links | Trade Show Listing |

        

 November
 2002



Applied Semantics
310.446.8162
appliedsemantics.com

 

 

 

 

 

 

 

 

 

 


 

 

 


 

 

 

 

 

 

 



 











 



 

 

Applied Semantics signs deal with USAToday.com
Young company automates metadata extraction, integrates editorial system

By Hays Goodman
Associate Editor



Geography matters.

So says Gilad Elbaz, the chief information officer and co-founder of Applied Semantics Inc., a Los Angeles based technology company.

The company resisted the mass migration of young upstarts moving lemming-like to Silicon Valley during the mid and late ’90s. Many of those companies met the same fate as lemmings, it turns out, but not Applied Semantics. Its conference table may have a few chips in it, and the chairs aren’t Aerons, but they’re making money and recently closed a deal with USAToday.com.

“When I came to [Los Angeles] I found a lot more affinity for the idea of focusing on what people want, what people need,” said Elbaz. “It was easier to find people who were interested in an idea and technology rather than if you had (venture capital firm) Sequoia backing you. Besides, my friend was starting Google at the time and everyone went to go work for him.”



Part of the Applied Semantics team at their office in Los Angeles. From left, Gilad Elbaz, co-founder and chief information officer; Chris Daniels, director of sales and business development; and Jordan Libit, chief executive officer.
Photo by Hays Goodman

Chris Daniels, director of sales and business development at Applied Semantics, approached USAToday.com in 2001 with the idea of using its product called Categorizer. This would assist the Gannett Co. Inc. newspaper with properly categorizing and summarizing editorial content on its Web site.

“I literally cold-called them,” Daniels recalled. “We were really just discovering newspapers and starting to go to the trade shows. I knew that they were on the forefront with experimenting with online architectures and online delivery.”

USA Today had built an in-house, eXtensible Markup Language-based editorial workflow system. Early on, the paper recognized the need for consistent, quality metadata to accompany articles. Complete metadata often allows an editorial system to provide accurate retrieval and a much richer search experience, especially when it comes to archives.

However, when metadata isn’t consistently applied, it leads to problems. One editor may categorize a story one way, and another will have a different idea and keywords he or she will assign. Inconsistently applied metadata, while better than none at all, still has a degree of randomness.

“The one thing we introduced there that I don’t believe they considered in the past was the [International Press and Telecommunications Council] taxonomy,” Daniels said. “I think that was a big draw: Applied Semantics was going with standards that were used in the newspaper industry.”

The companies agreed to test the process, which is based on a proprietary technology known as Conceptual Information Retrieval and Communication Architecture, or CIRCA. This system draws on the fundamental relationships of knowledge to organize textual data. This allows it to make use of the meanings contained in the text, rather than just simply recognizing the words.

A typical example is the word Java, which has a number of meanings, including a synonym for coffee, an Indonesian island and a computer programming language. All these three meanings are on the same “level,” which makes recognition and tagging relatively easy in that case based on the surrounding text.

In the case of a word like Ford, however, the system has to rank the relationships generated. Ford is a car manufacturer as well as a company. The concept “car manufacturer” is more specific than company, so it would receive a stronger value. This entire scheme of how concepts relate is called an ontology and forms the core of most linguistics engines produced today.

USAToday.com supplied Applied Semantics with thousands of example documents and the company achieved correct results more than 90 percent of them.

“Other categorization products get correct results in the 75-percent range, so we were very pleased with that,” Daniels said.

Once fine-tuning was complete, Applied Semantics shipped USAToday.com pre-configured servers. The Document Type Definition for the XML editorial system had been supplied earlier, so Applied Semantics was able to write the appropriate “hooks” into USAToday.com’s system ahead of time. The purpose of a DTD is to define the legal building blocks of an XML document. It defines the document structure with a list of legal elements.

“They cracked the boxes in early June and were implemented in two to three weeks.”

The servers have additional functionality beyond generating a categorization scheme and keyword metadata. They also create summaries, the first of which is very short and can be seen on the front of www.usatoday.com, right below the title of an article. The second one is a summary limited by character length, for delivery on personal digital assistants and other future small-screen delivery media. The third is a longer summary, which is an abstract that is inserted into the archive system.

“The editors can have Applied Semantics return results via pull down in the USAToday.com XML Editor application,” said Adrian Bouten, vice president of technology and business development at USAToday.com. “The menu item calls a macro that submits the story to Applied Semantics and returns results to the application.”

According to Bouten, the editors have been trained to use the system and it doesn’t significantly change the workflow.

“The system has greatly increased consistency in keyword tagging and categorization,” he said. “So far, the increase in editorial productivity is hard to determine (from the auto-summarization) … (but) the consistency in the metadata that is stored within the story greatly increases USAToday.com’s ability to syndicate and re-distribute the content.”

So far, the taxonomies and applications for the technology have been entirely designed around the English language. So what would happen if a French newspaper phoned up Applied Semantics tomorrow and wanted to integrate its technology?

“We’d ask them to kindly wait for a year or two,” Daniels laughed. “When you think of an ontology, it’s really language-independent. A chair is a chair no matter what language, and it’s related to the floor … but so far we’ve only mapped our ontology to English. A fair amount of it is also mapped to Spanish, but more of the complications have to do with the natural language processing on the front end before you even get to the ontology. It’s on the roadmap for us, though. By the end of ’03, it would be nice to have two or three languages done.”

Applied Semantics would likely pursue the idea with Spanish first, Daniels said, given the number of Spanish-language newspapers in the U.S. What are the companies’ other plans for 2003? Applied Semantics can’t yet publicly name the content management companies it would like to pair up with to integrate its technology into well-established editorial and workflow systems.

According to Jordan Libit, chief executive officer, the company will continue to focus on maintaining profitability and growing the staff as necessary. He thinks the increasing penetration of XML-based editorial systems in mid- and large-sized newspapers makes accurate tagging of metadata an expanding market and one that is ripe for Applied Semantics’ product line.