TCS - Using Search Engines on the Internet

Using Search Engines on the Internet

by Don Singleton
Tulsa Computer Society
From the March 1997 issue of the I/O Port Newsletter

Searching For Information on the Internet was the topic for the February 8 Internet Special Interest Group Meeting. Although these notes will not appear in print until a few days after that meeting, they may be considered the handouts for that meeting, in that they discuss the same topic that was demonstrated there.

There are a number of search engines listed on the TCS web page. Some are of a specialized nature, searching for businesses, people, etc, but most are of a general nature, and it is those that we will focus on here.

There are really two categories of tools that people can use to search for information on the Internet: Web Directories and true Search Engines. Yahoo (http://www.yahoo.com) is the best known Web Directory; it is a subject-tree style catalog that organizes web pages into 14 major topics, each with subtopics, and each of those having additional subtopics, as one moves from the more general to the more specific. When a person wants his web page listed by Yahoo, he can look at this "outline-structure" of the world and pick the two places in that structure where he feels his page most "belongs", and he provides a limited set of search terms that he feels people searching for his page would be likely to choose. Thus we have human intelligence selecting the search terms and positioning in the catalog. A person searching with Yahoo will probably not have to wade through a number of pages that really have nothing to do with what he is looking for, but he will be less likely to find pages that just touch on his subject. If there are a lot of pages that specialize in the topic he is interested in, that is fine, but if there are no pages that specialize in the topic, then the Yahoo searcher may be left with "no pages found", when there are pages out there that touch on the subject.

A true search engine uses software programs known as robots, spiders, worms, or crawlers, that are either asked to search a page by its author, or they may find the page as they follow hyperlinks from one document to another around the web. Once a search engine finds a page it reads all of the information on the page, and selects those words that it feels are important, and uses them to construct its index. Some search engines ignore commonly occurring stop words such as "a", "an", "the", "is", "and", etc. (I would not suggest using one of them to search for the phrase "To be or not to be") while others may include every word, including the stop words. Words that are mentioned toward the top of a document, and words that are repeated several times throughout the document are more likely to be deemed important, as are words that are used in titles, headings, subheadings, etc. Some web page authors try to take advantage of that fact; I have run across adult-oriented pages that may have a page with the word sex over and over and over again, because the author is hoping that will cause the search engines to make their page appear first when someone does a search on that term. I would not have thought that many people would be using the search engines to try to locate adult-oriented pages, when there are pages like www.persiankitty.com that seem to list so many of them, but I once ran across a page that listed the words most often used as search terms by search engines, and I was surprised to find that more than half of the words that appeard in the list of the most frequently used search terms, appeared to be looking for naughty pages. I hope those people found what they were searching for; I know that I have found the search engines very helpful searching for the things I have looked for (mostly things that would not have upset self-appointed web censors like State Rep Fred Perry and Senator Exon).

Most search engines basically use keyword indexing, where if the exact word, or perhaps a varient like the plural, or the past tense, is used on a page you will find it, but if you did your search on heart, and the page used the term cardiac, you would not find it. This means that you need to think about other words that might be used to describe what you are looking for. Some search engines attempt to use concept-based indexing, and if they found heart used in the same pages as words like coronary, artery, lung, stroke, cholestrol, pump, blood, attack, and arteriosclerosis would associate heart as being in the medical/health context, and would be likely to pull pages that referenced one of the related concept terms, while a page that used the word heart along with words like flowers, candy, love, passion, and valentine, would associate it with the context of romance, and would be likely to pull terms associated with that context. Concept-based indexing is a good idea, but it frequently falls down in practice, since the concept associations are made by a computer program, and not by a human being, and with those search engines it is even more important that one include several terms so that the search engine will know which subject one is looking for.

Yahoo

Yahoo (http://www.yahoo.com) is a web directory, or hierarchial subject index, rather than a search engine, but it is the place I look first if I know the name of a company, or product, or major topic, and want to find pages that focus primarilly on that topic. If Yahoo cannot find a term, it can automatically send you to its AltaVista partner for a search through their index of keywords. In their scoring of search engines The Spider's Apprentice gave them an A-. You can use Boolean AND and OR operators, and Yahoo is case insensitive.

AltaVista

AltaVista (http://www.altavista.digital.com) is a powerful keyword search engine that allows you to search on phrases of more than one word by putting them inside quotation marks. AltaVista is my second favorite search tool and in their scoring of search engines The Spider's Apprentice gave them an A. It allows search refining using boolean "AND", "OR", and "NOT", plus proximal locator "NEAR". It allows wildcards and supports "backwards" searching, where you can find all of the other web sites that link to a page.

InfoSeek

InfoSeek (http://guide.infoseek.com) now uses a new search technology called Untraseek, so it may soon be known by that name instead of InfoSeek. In their scoring of search engines The Spider's Apprentice gave them an A. It does not allow Boolean operators, but uses + and -, which are similar to AND and NOT.

Excite

Excite (http://www.excite.com) is a concept-based search engine combined with a Web-Directory listing (like Yahoo) and in their scoring of search engines The Spider's Apprentice gave them a B+. It uses a fuzzy AND, which searches AND and OR, giving preference to AND, and it has recently added Boolean Operators AND, OR, AND NOT, and the characters + and -. Excite has replaced WebCrawler as AoL's primary search engine.

Lycos

Lycos (http://www.lycos.com) is a large keyword search engine that is gradually adding Web-Directory services (like Yahoo). It is harder to refine a search using Lycos, but they do have a link to Roadmap (http://www.lycos.com/roadmap.html) which can generate maps based on addresses, which is nice. In their scoring of search engines The Spider's Apprentice gave them a B. In Search Refining one can ask for "Any", "Or, or "All" terms to be matched, and they allow one to specify "loose match", "fair match", "good match", "strong match", or "close match", although it unclear exactly what each means. It does not allow Complex Boolean operators, and one cannot search on phrases or on capitalization. One reason that I use other search engines before I use Lycos, is that they insist on downloading a lot of fancy graphics before they bring up the box to do the searching, and if the net is running slow for one reason or another, it can seem to take ages for Lycos to be ready to let you tell what you are looking for.

Magellan

In addition to being a search engine, Magellan (http://www.mckinley.com) reviews and rates a number of the sites it references, and it assigns a Green Light to sites that do not have any adult oriented material.

HotBot

HotBot (http://www.hotbot.com), formerly known as Inktomi, is an interesting search engine, and in their scoring of search engines The Spider's Apprentice gave them a B.

Open Text

Open Text (http://index.opentext.net) is not as extensive as some of the other search engines, and they have recently begun charging companies for "preferred listings" which cause their pages to come up early in a list of hits, regardless of whether they deserve that placement because they are really appropriate to the search criteria specified by the user, and because of that I rank them lower among the search engines I use. In their scoring of search engines The Spider's Apprentice gave them a B.

WebCrawler

WebCrawler (http://webcrawler.com) used to be AoL's primary search engine, but they have just replaced it with Excite, so I don't know how much longer WebCrawler will be around. In their scoring of search engines The Spider's Apprentice gave them a B-. They have full Boolean Search capability, including AND, OR, AND NOT, ADJ (adjacent) and NEAR.



For more information on the Tulsa Computer Society click here



This page has been accessed times.
Tulsa Computer Society 02/07/97
Don Singleton, President
tcs@galstar.com