refresh/expire, directory/search engines, spiders n bots [Was: Re: Page redirect]

by "Abhay S. Kushwaha" <abhay(at)kushwaha.com>
Date:	Sat, 9 Dec 2000 21:35:55 +0530
To:	"Basics [HWG]" <hwg-basics(at)hwg.org>
References:	sportsstuff gte
	todo: View Thread, Original
Lori, I think you're confusing the "refresh" with "expire".

See, what P.Wilson is pointing out is the use of the first one. You
must've come across pages which kind of load and then you find
yourself looking at a different URL altogether. The typical use is for
the pages which have moved. Original location tells you that the page
has moved and such n such is the new location and after 5-10 seconds
the browser "automatically" takes you to the new page. It is *there*
that the refresh tag is being used... It is "refresh this page after
this much time has elapsed and use the content from this location".

The "expire" however, tells the "caching system" that it is valid only
till such n such date and if this particular date has expired, it
better make sure that the browser retrieves a fresh copy from the
server. Here, the URL is not changed; instead, the browser retrieves
*all* the elements of the page afresh from the HTTP server.

You are also confusing search engines with directories. Yahoo! [1] and
DMOZ [2] are basically directories --- human edited. When you submit
your URL to be added to these sites, a team of human editors come to
your site, look at it, and then add it to the appropriate category.
However, Google [3] and Altavista [4] are examples of search engines.
They use a database of indexed pages to perform their search
operations for the string you enter in their search query boxes and
based on their "proprietory" algorithm will bring up the associated
URLs. The search engines use a "spider" or a "bot" ... program that
loads up your pages, making sure that NO linked page has been left
unloaded in your site when it's done, and indexes them, adding them to
their huge database.

To allay confusion, when you use the Yahoo! search, the searching
"program" first searches the Yahoo! directory itself (in-site search,
something like what Atomz [5] provides to webmasters) and then
searches it's own database -- Yahoo! is using Google's [3] search
engine now; earlier it was Inktomi's [6]. So, when you first see the
"category" results... it is the actual Yahoo's listing being
displayed -- but when you see the "web pages" results... it is the
Google's database in use...

Back to "refresh" tag.

Now, "bad" people can use the "refresh" tag to hook up a series of
documents with the keywords... the "spider" or the "bot" wouldn't know
the difference and would, as it is programmed to do, index them. Now,
in real life, a user comes along and searches in the engine using a
word that has been included in these pages. Based on multiple pages
with the same base URL having the desired keyword, these pages would
be listed at the top. And the searcher will be tempted to click on
them. Now... whether or not the info is there is a different question
but the point is that the logic of the search engine has been
compromised by "cheating". Hence, the bots have "rules". The common
most rule that *every* spider/bot has these days is to ban sites with
"invisible text" (that is text that is same as the background colour).
Now, the heavy duty search engine spiders/bots are slowly also
incorporating rules against this "refresh" cheating.

To give you an idea, take a look at Altavista's Submission Policies
[7] when you are trying to sumbit a URL for it to index. It says, that
they will exclude, among others:
 . Machine-generated pages with minimal or no content, whose sole
   purpose is to get a user to click to another page,
 . Pages that contain only links to other pages, or
 . Pages whose primary intent is to redirect users to another page.
Get the idea? Normally the "refresh" pages would be grouped under the
last point but I think JS generated "refresh" pages should get covered
under the first point that I have listed from the list.

The directory editors, however, since they are human, will evaluate
this "refresh" thing in human terms. Human editors don't look at the
code when they add your site to the directory listing -- all those
"meta tags" that you lovingly put in there are for the search engine
spider/bots and are useless for the human editors. They add your site
if they think that the site content justifies it to be present in the
category it was submitted to.

Hope this explains it.

[abhay]

[1] http://www.yahoo.com
[2] http://www.dmoz.org
[3] http://www.google.com
[4] http://www.altavista.com
[5] http://www.atomz.com
[6] http://www.inktomi.com
[7] http://www.altavista.com/cgi-bin/query?pg=addurl#form

----- Original Message -----
From: "Lori Eldridge" <lorield(at)uswest.net>
Sent: Saturday, December 09, 2000 7:27 PM


> > It is not recommended to use the meta refresh tag or other
> > common forms of refresh.  Several of the Search Engines ban
> > websites that use refresh.  They think some splash pages
> > are used to trick their indexing spider and allow a higher
> > SE rating.
> >
> >Paul Wilson
>
> I have the meta refresh tag on almost all of my web pages so
> people will get the updated copy and not one in their cache.
> I had never heard that some search engines won't list a site
> with them. Can you tell us which ones? Must not be Yahoo
> because they have listed all but one of my sites that I
> requested.
>
> Is there another way to make sure viewers get the most up date
> page on a web site?
HTML: hwg-basics mailing list archives, maintained by Webmasters @ IWA