In the past 12 months, Google doubled its staff, tinkered with its search engine to speed up results, and now answers more queries than Microsoft and Yahoo combined. But there’s one query we had to answer ourselves: How does Google work?
Blame spell-check. Ten years ago this September, so the story goes, some Stanford grad students were helping Larry Page choose a name for his search engine. “Googolplex,” said Sean Anderson. (It means 1010100—they’d already sensed how big this could become.) “Googol,” Page replied. Anderson, checking to see if the name was taken, typed g-o-o-g-l-e into his browser and made the most famous spelling mistake since p-o-t-a-t-o-e. Page registered the name within hours, and today, Google isn’t a typo, it’s a verb, one with a market cap of about $160 billion. Here, then, is a guide to what happens during a typical Google search—now, of course, with automatic spell-check.
1. Query Box
It all starts with somebody typing in a request for information about the safest dog food, what time the D.M.V. closes, or what the prime rate is in China.
2. Domain-Name Servers
“Hello, this is your operator . . . ”
The software for Google’s domain-name servers runs on computers in leased or company-owned data centers all over the world, including one in the old Port Authority headquarters in Manhattan. Their sole purpose is to shepherd searches into one of Google’s clusters as efficiently as possible, taking into account which clusters are nearest to the searcher and which are least busy at that instant.
3. The Cluster
The request continues into one of at least 200 clusters, which sit in Google-owned data centers worldwide.
4. Google Web Server
This program splits a query among hundreds or thousands of machines so that they can all work on it at the same time. It’s the difference between doing your grocery shopping all by yourself and having 100 people simultaneously find one item and toss it into your cart.
5. Index Server
Everything Google knows is stored in a massive database. But rather than waiting for one computer to sift through those gigabytes of data, Google has hundreds of computers scan its “card catalog” at the same time to find every relevant entry. Popular searches are cached—held in memory—for a few hours rather than run all over again. That means you, Britney.
6. Document Server
After the index server compiles its results, the document server pulls all the relevant documents—the links and snippets of text from its massive database. How does Google search the Web so quickly? It doesn’t. It keeps three copies of all the information from the internet that it has indexed in its own document servers, and all those data have already been prepped and sorted.
7. Spelling Server
Google doesn’t read words; it looks for patterns of characters, be they in English or Sanskrit. If it sees your requested pattern a thousand times but finds a million hits for a similar pattern that’s off by one character, it connects the dots and politely suggests what you probably meant, even while it provides you the results, if any, for your fat-fingered query for “hwedge funds.”
8. Ad Server
Each query is simultaneously run through an ad database, and matches are fed to the Web server so that they’re placed on the results page. The ad team is in a race with the search team. Google vows to deliver all searches as quickly as possible; if ad results take longer to pull up than search results, they won’t make it onto the page—and Google won’t make money on that search.
9. Page Builder
The Google Web server collects the results of the thousands of operations it runs for a query, organizes all the data, and draws Google’s cunningly simple results page on your browser window, all in less time than it took to read this sentence.
10. Results Displayed
Often in 0.25 seconds or less.
Cluster Control
Google’s genius lies in its networking software, which helps thousands of cheap computers in a cluster act like one huge hard drive. Those inexpensive computers allow Google to replace parts without stopping the whole show: If a computer drops dead, there are at least two others ready to take its place while an engineer swaps out the busted machine.
Power Power
Just about the only thing limiting Google’s performance is how much electricity the company can buy. One of its newest data centers (code name: Project 02) is near the Columbia River in The Dalles, Oregon, which has access to 1.8 gigawatts of cheap hydroelectric power; not coincidentally, this is where major internet hookups from Asia connect to U.S. networks. The byte factory has two computing centers, each the size of a football field.
Petabytes
Based on the few numbers Google releases, experts guess that at least 20 petabytes of data are stored on its servers. But Googleytes are famous for understatement; Wired says Google may have 200 petabytes of capacity. So how much is that? If your iPod were just 1 petabyte (one million gigabytes), you’d have about 200 million songs to shuffle. And if you started downloading a petabyte over your high-speed internet connection, your great-great-great-great-grandchild might still be around when the last few bytes get transferred, in 2514.
Page Rank
Google decides how reliable a site is—and thus how important the site’s content will be when Google forms a list of search results—by considering more than 200 factors as it analyzes content. But the secret sauce is Google’s patented formula for following and scoring every link on a page to learn how different sites connect, which means a site is deemed reliable based largely on the quality of the sites that link to it.
Googlebots
Google deploys programs called spiders to build its copies of the internet. On popular sites, Googlebots may follow every link several times an hour. As they scour the pages, the spiders save every bit of text or code. The raw data are pulled back into the cluster, run through the mill, and scheduled to incrementally replace the older data already on the index and doc servers, ensuring that results are fresh, never frozen.
Source:www.portfolio.com
Social media optimization is taking the world by storm and here is an attempt to tell what all this is about.
Showing posts with label Google page rank. Show all posts
Showing posts with label Google page rank. Show all posts
Monday, August 13, 2007
Wednesday, August 1, 2007
Google removes labeling of supplemental results
Google claims that being in the supplemental index will pose less of a problem in the future.
To simplify: Google operates with two main indexes of web search results: (1) the regular database and (2) supplemental results.
Supplemental results consists of what Google considers to be less important pages. These could be pages with no or very few inbound links, duplicates etc.
The Valley of Death
Many webmasters regularly wake up screaming in the night, having dreamed that their whole site has gone supplemental.
Supplemental has been considered the valley of death to search engine marketers, as Google normally only presents results from supplemental if it cannot find a reasonable number of relevant hits in the regular index.
Furthermore, pages in the supplemental index are revisited less often by Google.
Well, from now on webmasters may sleep a little sounder, not because supplemental will go away, but because Google is removing the label that identifies supplemental search results.
You can still search for supplemental
The company has also removed the most used way of finding supplemental results by searching Google, but as Danny Sullivan over at Search Engine Land points out, the following technique is still working:
Do a search for: site:yourdomainnamehere.com/&
This option will probably removed soon.
The difference between the main and supplemental index is narrowing
The supplemental index remains. In a blog post over at the Google Webmaster Central blog Matt Cutts & Co say:
Since 2006, we’ve completely overhauled the system that crawls and indexes supplemental results. The current system provides deeper and more continuous indexing.
Additionally, we are indexing URLs with more parameters and are continuing to place fewer restrictions on the sites we crawl. As a result, Supplemental Results are fresher and more comprehensive than ever.
We’re also working towards showing more Supplemental Results by ensuring that every query is able to search the supplemental index, and expect to roll this out over the course of the summer.
Supplemental results will apparently turn up more frequently in search results, and the results will be fresher.
Is this really a problem?
It helps to keep in mind that Google needs to be able to develop an algorithm that’s makes it possible to sort out the best and most relevant search results to a query. This is for instance what PageRank is all about.
Even if Google decided to merge the two databases, there would be an enormous number of pages that would never or seldom turn up on the first 10 pages of search results. This applies to most queries.
Pages with no inbound links or little content or from web sites with very little authority (spammy scraper sites, for instance), will normally only turn up when people do very specialized searches, hitting the “long tail” of search. This is the way it should be.
This Valley of Death only becomes a problem if Google starts to fill it with well-written, informative, and original articles and blog posts, while at the same time serving junk pages on the first page of results.
This has been known to happen, and by labeling the valley as “supplemental results” it has become easier for webmasters to arrest Google when such mistakes occur.
So, if you are cynical, you may say that Google is now trying to hide its mistakes.
We do not think this is the case. Even without the label, it is easy to document cases of high quality pages being buried and junk floating to the top. All you have to do is to analyze earch results for selected queries.
Google may introduce a search tool for supplemental pages
According to Danny Sullivan Matt Cutts of Google feels that a search syntax for identifying supplemental results cause site owners to needlessly fixate too much on such results, in the same way as they often obsess over PageRank:
“Still, he said that Google would probably come up with a way for people to perform a query within Google Webmaster Central or some other method to find out if a page or pages are in the supplemental index.”
Sullivan asks for a a tool within Google Webmaster Central that provides a list of your own supplemental pages, or a percentage of pages from a site that are deemed supplemental as a health check.
But why stop at supplemental? Why not introduce a search tool that helps webmasters identify different types of “unhealthy pages”. A Webmaster Central could therefore give a diagnosis of individual pages or sets of pages, listing relevant causes for why the pages rank poorly, like for instance:
* Lack of backlinks
* Lack of content
* Duplicate content
* Robots.txt and metatag problems
* Server downtime
* Bad coding
* etc.
We realize that there are limits to how far Google can go in this direction, as any input of this kind will be used by webmasters trying to reengineer the Google algorithm, but in general Google will benefit from webmasters cleaning up their acts. Helping them understand what’s wrong with their pages will contribute to this.
Indeed, the Google Webmaster Central already provides information on
* HTTP errors
* Not found
* URLs not followed
* URLs restricted by robots.txt
* URLs timed out
* Unreachable URLs
* Links
This will mostly be a matter of systematizing the relevant information along the dimension of individual pages.
Source:www.pandia.com
To simplify: Google operates with two main indexes of web search results: (1) the regular database and (2) supplemental results.
Supplemental results consists of what Google considers to be less important pages. These could be pages with no or very few inbound links, duplicates etc.
The Valley of Death
Many webmasters regularly wake up screaming in the night, having dreamed that their whole site has gone supplemental.
Supplemental has been considered the valley of death to search engine marketers, as Google normally only presents results from supplemental if it cannot find a reasonable number of relevant hits in the regular index.
Furthermore, pages in the supplemental index are revisited less often by Google.
Well, from now on webmasters may sleep a little sounder, not because supplemental will go away, but because Google is removing the label that identifies supplemental search results.
You can still search for supplemental
The company has also removed the most used way of finding supplemental results by searching Google, but as Danny Sullivan over at Search Engine Land points out, the following technique is still working:
Do a search for: site:yourdomainnamehere.com/&
This option will probably removed soon.
The difference between the main and supplemental index is narrowing
The supplemental index remains. In a blog post over at the Google Webmaster Central blog Matt Cutts & Co say:
Since 2006, we’ve completely overhauled the system that crawls and indexes supplemental results. The current system provides deeper and more continuous indexing.
Additionally, we are indexing URLs with more parameters and are continuing to place fewer restrictions on the sites we crawl. As a result, Supplemental Results are fresher and more comprehensive than ever.
We’re also working towards showing more Supplemental Results by ensuring that every query is able to search the supplemental index, and expect to roll this out over the course of the summer.
Supplemental results will apparently turn up more frequently in search results, and the results will be fresher.
Is this really a problem?
It helps to keep in mind that Google needs to be able to develop an algorithm that’s makes it possible to sort out the best and most relevant search results to a query. This is for instance what PageRank is all about.
Even if Google decided to merge the two databases, there would be an enormous number of pages that would never or seldom turn up on the first 10 pages of search results. This applies to most queries.
Pages with no inbound links or little content or from web sites with very little authority (spammy scraper sites, for instance), will normally only turn up when people do very specialized searches, hitting the “long tail” of search. This is the way it should be.
This Valley of Death only becomes a problem if Google starts to fill it with well-written, informative, and original articles and blog posts, while at the same time serving junk pages on the first page of results.
This has been known to happen, and by labeling the valley as “supplemental results” it has become easier for webmasters to arrest Google when such mistakes occur.
So, if you are cynical, you may say that Google is now trying to hide its mistakes.
We do not think this is the case. Even without the label, it is easy to document cases of high quality pages being buried and junk floating to the top. All you have to do is to analyze earch results for selected queries.
Google may introduce a search tool for supplemental pages
According to Danny Sullivan Matt Cutts of Google feels that a search syntax for identifying supplemental results cause site owners to needlessly fixate too much on such results, in the same way as they often obsess over PageRank:
“Still, he said that Google would probably come up with a way for people to perform a query within Google Webmaster Central or some other method to find out if a page or pages are in the supplemental index.”
Sullivan asks for a a tool within Google Webmaster Central that provides a list of your own supplemental pages, or a percentage of pages from a site that are deemed supplemental as a health check.
But why stop at supplemental? Why not introduce a search tool that helps webmasters identify different types of “unhealthy pages”. A Webmaster Central could therefore give a diagnosis of individual pages or sets of pages, listing relevant causes for why the pages rank poorly, like for instance:
* Lack of backlinks
* Lack of content
* Duplicate content
* Robots.txt and metatag problems
* Server downtime
* Bad coding
* etc.
We realize that there are limits to how far Google can go in this direction, as any input of this kind will be used by webmasters trying to reengineer the Google algorithm, but in general Google will benefit from webmasters cleaning up their acts. Helping them understand what’s wrong with their pages will contribute to this.
Indeed, the Google Webmaster Central already provides information on
* HTTP errors
* Not found
* URLs not followed
* URLs restricted by robots.txt
* URLs timed out
* Unreachable URLs
* Links
This will mostly be a matter of systematizing the relevant information along the dimension of individual pages.
Source:www.pandia.com
Subscribe to:
Comments (Atom)