ETech: Google lets us look into its search and translation technology
Yesterday, Google's director of research Peter Norvig let visitors at the Emerging Technology conference in San Diego look into the technology that his firm uses in search and translation functions. As Norvig put it, a lot of the time Google doesn't rely on complex models and theories, but simply on large amounts of data.
One example is Google's translation function, turns Chinese texts into English. In Chinese, multiple symbols that mean something on their own can be combined to create a single word. Google segments Chinese texts by comparing a large amount of Chinese and English versions of the same content to increase the probability that the Chinese characters will match the English words.
This language segmentation also plays a role in the spell checker in Google Documents. To demonstrate how hard it is even for people to segment words properly, Norvig displayed a few potentially misleading domain names: perhaps, the holders of Whorepresents.com, Therapistfinder.com and Penisland.net (a website that sells pens) might not have realised that their Web addresses could be misinterpreted.
Google uses similar comparative approaches to improve its search of images. Up to now, the search has only relied on text-based metadata, which have a relatively high error rate. Researchers at Google are now working to make these searches more precise through image analysis. 1000 images found in the metadata search are then compared to find similarities in order to produce the most relevant image. Norvig says that Google has also already started developing similar approaches for video data. Norvig explained "That will be harder though because of the cost of mass storage", .
Norvig believes that lack of resources makes it hard for newcomers to analyse mountains of data. He said that companies like Google and Yahoo have enough data records for comparative analyses, but startups don't, which is why Google looked into making its data publicly available. "We wanted to publish part of the web for use as comparative data", Norvig stated, but Google's lawyers put an end to the project for fear of copyright violations. "Apparently, it's okay to provide individual websites from the cache, but not to burn them and send them out as CDs."
See also ETech 2008: