Webindexing.biz

Quality indexing solutions

  • Increase font size
  • Default font size
  • Decrease font size

Website Indexing 2Ed Glossary

E-mail Print PDF

Appendix 4: Glossary

Words in bold indicate that there are separate glossary entries on those topics. This glossary is taken from the second edition of Website Indexing, by Glenda Browne and Jonathan Jermey.

Absolute addressing: the practice of including a complete URL in the anchor tag for a link to a webpage. Links to other websites are always absolute but links within a website may be absolute or relative.

Adobe Acrobat PDF, see PDF

Agent, see Bot

Anchor (Bookmark): an HTML anchor makes the location in the file at which it is inserted available as a target for a link. It is written in the format <a name=" SectionName" title=" SectionName">

Automated categorisation: the use of computer software to categorise webpages. It can be done using rule-based methods, in which the system is gradually trained, or by fully automated methods. Taxonomies for categorisation can also be created automatically.

Back-of-book-style indexing: creation of a website index that looks and functions like a back-of-book index. It will usually be alphabetically organised, give detailed access to information, and contain index entries with subdivisions and cross-references.

Bookmark, see Anchor

Boolean ‘and’: Use of the Boolean operator ‘and’ in a query means that all of the terms in the query must be present in a document for it to be retrieved. For example, ‘automated and categorisation’ means that a document must contain the term ‘automated’ and the term ‘categorisation’.

Boolean ‘not’: Use of the Boolean operator ‘not’ in a query means that if the search term is present in a document, that document will not be retrieved. For example, ‘bear not market’ will not retrieve a document with the sentence ‘Share prices have gone down in the bear market’.

Boolean ‘or’: Use of the Boolean operator ‘or’ in a query means that any one of the terms in the query must be present in a document for it to be retrieved. For example, ‘categorisation and categorization’ means that a document must contain either the term ‘categorisation’ or the term ‘categorization’.

Bot (Agent, Robot): programs with some artificial intelligence that are sent to do a task in lieu of a real person. Spiders are one example. They run automatically and act autonomously.

Breadcrumb: link to all levels of the hierarchy above the current location, showing the route a searcher has taken, and the context of the current page. Breadcrumbs allow users to backtrack and to move up the hierarchy.
For example, Rhinitis>Allergic rhinitis>Perennial allergic rhinitis (Hayfever).

‘Breadcrumbs’ is based on the story of Hansel and Gretel, who dropped bits of bread to make a trail to help them find their way out of the forest. (Not that it helped them, as the birds ate the crumbs!).

Breadth: the number of navigation options available at each stage. A home page that provides links to 20 subsections has more breadth than one that says ‘Click here to select a department’.

Cascading style sheet, see Style sheet

Categorisation:  the use of hierarchies based on words rather than notations. Each topic is allocated to a group, and that group is allocated to a more general group, and so on. Searching typically involves moving from more general to more specific topics; for example, to search for information on children’s birthday parties you might first select the option Celebrations, then Birthdays, then Children’s parties. Category structures are fairly arbitrary and may vary widely from one site to another; on a different site you might select Catering, then Parties, then Children’s parties, then Children’s birthday parties, eg. By using techniques such as double posting and cross-referencing, categorised sites can provide for access from several different directions. See also automated categorisation.

Chunk: smallest unit of content that is used independently and needs to be indexed individually.

Classification: formal established classification schemes – for example, the Dewey Decimal Classification (DC) and Library of Congress Classification (LC) – that use a notation to describe classes of information.

Collaborative filtering: personalisation technology that uses recommendation engines to extract trends from the behaviour of website visitors and use that information to present suggestions to searchers. Amazon.com uses collaborative filtering to recommend books on the basis of purchases by other people with apparently similar interests.

Concordance: a (usually alphabetical) list of words from a book or website indicating the locations at which they occur. A concordance differs from an index in that no attempt to filter the source material or sensibly collate the information has been made.

Content management system (CMS): system for the creation, modification, archiving and removal of information resources from an organised repository. Includes tools for publishing, format management, revision control, indexing, search and retrieval.

Controlled vocabulary: a list of terms to be used in indexing (or cataloguing), often a thesaurus or synonym ring. Use of the same list by all indexers enhances consistency. Most libraries use the Library of Congress Subject Headings as a controlled vocabulary for cataloguing books and other library items.

Cost-per-click listing (CPC), see Pay-per-click listing (PPC)

Crawler, see Spider

Cross-reference: a See reference or See also reference leading the user from one part of the index to another.

CSS, see Style sheet

Database: a collection of records about individuals. Each record is made up of a number of fields relating to different characteristics of the individual. Many websites and web indexes are generated from databases .

Depth of hierarchy: the number of levels in the navigation hierarchy to the most specific topics. A site where you can select ‘amphibians’ then ‘frogs’ is shallower than one where you have to select ‘animals’, ‘vertebrates’, ‘amphibians’ and then ‘frogs’.

Depth of indexing: the number of entries and their specificity. A deep index will give direct access to all the topics that have been dealt with in the text. A shallow index will cover major and general topics, but will not index minor topics.

Dialog box: a box into which users of a computer application can enter information.

Directory: a collection of evaluated links to websites, usually categorised by subject. Many search engines, such as Yahoo and Google, have associated directories. When directories are limited to information on a specific subject or discipline they are often called subject gateways.

Distributed authoring: content creation by people distributed throughout an organisation, not by a centralised group of web specialists or writers. With distributed authoring there is often an expectation that subject metadata will also be created by authors. This is distributed indexing.

Document: Any item (not necessarily on paper) that can be indexed or catalogued.

DTD (Document Type Definition): schema specification method for XML documents. A DTD is a collection of XML markup declarations that define the structure, elements and attributes that can be used in a document that complies with that DTD. By consulting the DTD a parser can work with the tags from the markup language that document uses. DocBook is an example of a DTD often used with technical documentation to enable sharing and reuse.

Ebook/Electronic book: standalone document intended for on-screen reading on a PC or a handheld device, either a dedicated ‘reader’ or a general purpose Personal Digital Assistant (PDA).

Editorial results: search engine hits dependent on content and not influenced by payment.

Embedded indexing: indexing method in which tagged index entries are inserted into document files. Tags are used to bracket blocks of text and to show headings and subdivisions for index entries. Tagged index entries are not seen in the printed version, but can be compiled by software to make an index. If parts of the document are removed or rearranged the tagged index terms go with them. The index can then be recompiled to give an updated version. Embedded indexing is more time-consuming than normal indexing, but is efficient for documents that change often, or are not complete when indexing starts.

Facet: grouping of concepts of the same inherent type, for example, processes, disciplines, people, materials, places, and times.

Faceted metadata classification: breaking subjects into standard component parts (facets) and presenting these to users as search options. A topic such as wine might be divided into the facets such as country of origin, variety and price. In the best faceted search systems the user is provided with feedback about the number of terms retrieved at each stage.

False drop: document that is retrieved by a search but is not relevant to the searcher’s needs. False drops occur because of words that are written the same but have different meanings (for example, ‘squash’ can refer to a game, a vegetable or an action). 

Field searching: ability to limit a search by requiring that the search term is present in a specific ‘field’ (category of data) in the record. Field searching is often done with categories such as author and date that are common to most records.

Filing order: rules used for ordering (sorting) index entries. When a computer performs the sequencing it is often called sort order

Gateway, see Subject gateway

Global navigation: generally applicable navigational links (for example, Search; Site Map) available from all pages of a website.

Granularity: level of detail at which information is viewed or described. The more granular an access tool, the smaller the chunks of information it leads to. An index linking to specific paragraphs is more granular than a table of contents or site map linking to specific pages.

<HEAD> section: The section of an HTML document is placed at the top of the page between an opening tag, , and a closing tag, , and contains metadata about the document itself, not the content that will be displayed on the page.  It is followed by the section.

Hierarchy: a series of ordered groupings moving from broader general categories to narrow specific ones. In a web directory you may only see one level of the hierarchy at a time. When you select a topic you are then shown the options at the next level. See also Taxonomy; Thesaurus.

Hit highlighting: highlighting of the words in a results list which resulted in a document being retrieved by the search.

HTML: hypertext markup language. The majority of webpages are made up of ordinary text ‘marked up’ with instructions in HTML which determine how the text is displayed by the user’s browser: for instance, the HTML code ‘Huey, Dewey and Louie’ appears in a browser as ‘Huey, Dewey and Louie’. HTML is also used to display graphics and define links to other sites and locations. See also XML.

Hypertext link, see Link

Indented style index: indented indexes start each subdivision on a new line, indented under the main heading. For example:

            names
               
indexing rules for  41-42
               
keyword searching and  5

Index entry: record in an index, consisting of a main heading and any associated locators, subheadings, and cross-references. This means the whole ‘metadata’ example below is one entry. When indexers charge by the entry they usually define each cross-reference or locator as an entry, meaning the ‘metadata’ sample below would contain six entries, made up of one cross-reference and five locators.

            metadata, see also thesauri
                  Dublin Core  15, 33-37
                  misspellings useful in  14
                  website structure derived from  99-101, 105

Indexing: often used to refer to the automatic selection and compilation of ‘meaningful’ words from a website into a list that can be used by a search system to retrieve pages. This list is more properly called a concordance. As this procedure involves no intellectual effort indexers distinguish their own work by calling it intellectual indexing, manual indexing, human indexing, or back-of-book-style indexing.

Information architecture: design of the structure of information systems, particularly websites and intranets, including labelling and navigation schemes.

Information foraging: seeking information according to its adaptive value. Information foraging theory analyses trade-offs in the value of information gained against the costs of searching based on the analogy of ‘foraging for wild food’.

Information scent: visual and linguistic cues that indicate to a searcher whether a website has the information they seek, and help the searcher navigate to the required information. Information scent is a component of information foraging.

Instantiation: the electronic or physical manifestation of a resource.

Internet: a global electronic communications system allowing public access to email, newsgroups, chat and the web.

Intranet: a local network with restricted access that uses some or all of the same systems and software as the Internet.

Keyword: a) In the search engine section keywords are words that are used to search for a topic. Also called ‘search terms’. b) In the metadata section, keywords are subject metadata terms.

Keyword searching: typing significant words and phrases that relate to a topic into a search engine. For example, to find information about your pet, Gerby, you might type the keywords gerbils and sand rats. If you wanted scientific information you could try the scientific terms Gerbillus, Tatera, Taterillus gracilis and so on. If you needed to find more general information you could broaden your search with the terms domestic animals and pets.

Legacy data: data stored on older computer systems or in older formats that remains behind as the legacy of outdated technologies. It can be difficult to integrate into newer systems.

Link: a block of text or a graphic appearing on a webpage, which a user can click with the mouse pointer to cause an event to happen. This usually involves being taken to another webpage or another part of the same page.

Live file: the copy of an electronic document that is currently being worked on, for example, by a writer or indexer. All changes must be made to the live file. If an indexer worked on one copy of a document, and an editor on another, the changes made by one of them would have to be incorporated into the document worked on by the other. (Live in a different context means that the file has been loaded onto the web and made available to users).

Local link, see Relative addressing

Local navigation: links that are specific to a section of a website, compared with global navigation which is available from all parts of a site.

Locator: the part of an index entry that tells the user where to look for information. In a book index locators are usually page numbers (but can also be references to items, paragraphs and so on). In a web index they are direct links to the information. The links can be the heading or subdivisions of the index entry.

Main heading: heading at the beginning of an index entry, either used alone or modified by subheadings. The main heading is an entry point into the index. (Cross-references are the other entry points).

Markup language: a way of depicting the logical structure or semantics of a document and providing instructions to computers on how to handle or display the contents of the file. HTML, XML and RDF are markup languages. Markup indicators are often called tags.

Metadata: structured data about data, which may include information about the author, title and subject of web resources. Metadata is added in the section of the webpage or is stored in a database. It is available for searching but is not displayed on the page.

Multi-purposing, see Single sourcing

Namespace: a closed set of names or a place where a schema (set of names) is stored. Namespaces are identified via a URI (for example, a URL) and are a mechanism to resolve naming conflicts. Within a given namespace all names must be unique, although the same name may be used with a different meaning in a different namespace.

Notation: code used in formal classification schemes. In the Dewey Decimal Classification the notation 993 refers to the history of New Zealand, and the notation 994 refers to the history of Australia.

Ontology: specification of a conceptualisation of a knowledge domain. An ontology is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. The vocabulary is used to make queries and assertions. Ontological commitments are agreements to use the vocabulary in a consistent way for knowledge sharing. Ontologies can include glossaries, taxonomies and thesauri, but normally have greater expressivity and stricter rules than these tools. A formal ontology is a controlled vocabulary expressed in an ontology representation language.

Page number, see Locator

Pageless index: electronic index in which index entries link directly to the text they refer to rather than listing page numbers for the user to find.

Paid inclusion: payment for inclusion of a site in a search engine’s editorial listings, without an artificial boost in ranking.

Paid placement (Advertising): Listing in search engine results where advertisers pay for a guaranteed high ranking, usually dependent on specified keywords being used in a search. These listings are usually segregated from editorial results and labelled to indicate that they are ads. Also known as ‘pay for placement’, ‘pay for performance’, or pay-per-click listings (PPC). The last two terms refer to the usual method of payment, which is based on the number of times the link is selected (‘clicked’) by a user. See also sponsorship.

Paid submission: payment for guaranteed consideration of a site for inclusion in a directory.

Pay-per-click listing (PPC): search engine advertising in which payment is based on the number of times the website is selected (clicked) from the results list. Is used in paid placement advertising and paid inclusion. Also known as cost-per-click listings (CPC).

PDF (portable document format) file: a way of displaying documents in the form in which they will be printed. Used when for legal or other reasons an exact copy of a printed document must be made available electronically. PDF files are displayed and printed using Adobe Acrobat Reader software.

Pick list: A list from which a computer user can select terms. Usually found in a menu, form or dialog box.

Precision: the relevance to the searcher of the items that are retrieved. If a search retrieves one hundred documents of which ninety-five are very relevant, that search has high precision.

RDF: a formal data model from the W3C using XML for the description of web resources using machine readable metadata. It has potential for use in the semantic web.

RDF schema (RDFS): defines a set of metadata properties (for example, ‘Creator’) that can be associated with resources.

Recall: the proportion of relevant information that is retrieved by a search. If a search only retrieves one hundred relevant documents out of three thousand that are available, that search has low recall. If it retrieves all the available documents on the topic, it has high recall.

Recommendation engine, see Collaborative filtering

Relative addressing (Local links): linking to another page on the same website through a local address rather than a URL: for instance, a link back to the homepage might take the form Home page.

Repurposing, see Single sourcing

Robot, see Bot

Run-on (run-in) style index: run-on indexes list all subdivisions in sequence, separated by punctuation marks such as semicolons. For example:

names: indexing rules for  41-42; keyword searching and  5

Schema: a description of the structure and rules a document must satisfy for an XML document type.  Includes the formal declaration of the elements that make up a document.

Search engine: server that ‘indexes’ webpages, stores the results, and uses them to return lists of pages which match users’ search queries. See also Directory; Indexing.

Search log: record of searches performed

Search term, see Keyword

See also reference: directs index users to related topics that could be consulted in addition to the topic they are currently at: for example, ‘beds, 26, see also cots’

See reference: a way of indicating to a user that they should look elsewhere. A see reference may point to two or more locations: for example, ‘rodents, see mice; rats’. The choice of which terms to use and which to refer from depends on the language of the material being indexed and the target audience.

Semantic web: project of the W3C in which automated methods based on quality metadata are envisaged to replace much human searching of the web. Relies on ontologies, XML and RDF.

Semantics: meaning. If a computer understands the semantics of a document, it understands the meaning, rather than just interpreting a series of characters.

Single sourcing: using one content repository to generate documents in different formats. The content only needs to be written and maintained in one place, but can be output in formats such as HTML and RTF (rich text format) as required. Also known as multi-purposing. Repurposing refers to the sequential output of content in different formats using different software tools.

Site indexing, see Website indexing

Site map: Overview of the navigational structure of a website, acting like a Table of Contents, and used to orient users and show them the scope of the site. Site maps can be textual or visual. Usually each location is an active link, enabling a user to move directly to that section. Site maps can also be important sources of links for search engine spiders to follow.

Sort order, see Filing order

Specificity: narrowness of terms. ‘Maternity leave’ is more specific than ‘parental leave’, which is itself more specific than ‘leave’. Book indexers normally aim to use a term with the same specificity as the information being indexed, although users often search with broader terms.

Spider (Crawler, Web crawler): bot that visits publicly accessible websites following all links it comes across collecting data for search engine ‘indexes’. A spider discovers new sites and updates information from sites previously visited. A spider can also be used to check links within a website.

Sponsorship: Sponsored ads are normally located in a separate boxed and labelled section at the top of a search engine’s results list. See also paid placement.

Stemming: expansion of searches to include plural forms and other word variations.

Style sheet: a block of text in which one or more formats for webpage display are defined. This may include redefinitions of standard formats such as or new formats specific to that page or site. Style sheets may be embedded in a particular webpage or stored as a separate text file to which some or all of the webpages on a site are linked. Where several style sheets are linked to one page, the order in which they are named determines which ones take precedence in the case of conflicting definitions. These are called cascading style sheets (CSS).

Subheading/subentry/subdivision: headings that follow a main heading to modify it. In the index sample below, metadata is the main heading, and ‘Dublin Core’ and ‘misspellings useful in’ are subheadings.

            metadata, see also thesauri
                Dublin Core  15, 33-37
               
misspellings useful in  14
               
website structure derived from  99-101, 105

Subject gateway: a directory limited to a specific subject area such as education, or Tasmania. Sometimes called a ‘portal’.

Subsite: a distinct section of a website that might warrant its own navigational systems.

Supplementary navigation: information access methods separate from the basic site structure or browse navigation. See also Back-of-book-style indexing; Site map.

Synonym ring/list: sets of synonyms. If someone searches using one synonym from a set (ring), the other words or phrases in the set are also included in the search.

Table of contents, see Site map

Tag: a piece of text that describes the semantics or structure of a unit of data (element) in HTML, XML or other markup language. Tags are surrounded by angle brackets () to distinguish them from text. ‘Tags’ is also used to describe the code indicating index entries in embedded indexing.

Target, see Anchor; URL

Taxonomy: controlled vocabulary used primarily for the creation of navigation structures for websites. Often based on a thesaurus, but may have shallower hierarchies and less structure, for example, no related terms.

Thesaurus: a structured list of approved subject headings (preferred terms) showing the relationships between them. The relationships include broader (parent) terms, narrower (child) terms, and related terms. The thesaurus also shows terms that are not to be used in indexing (nonpreferred terms) with references to the terms that should be used instead (for example, ‘automobiles, see cars’). See also taxonomy. (According to the NISO standard on thesaurus construction, the plural of ‘thesaurus’ can be ‘thesauri’ or ‘thesauruses’. We used ‘thesauruses’ in the first edition of this book, but have yielded to reviewer preferences and the interests of brevity.)

Three-click rule: the three-click rule suggests that if a user has to click more than three times to find the information they are looking for they will give up the search.

Topic map: tool for representation of model-based data on the web for enhanced access. Topic maps are based on topics, associations and occurrences. In comparison with RDF, topic maps are developed separately from the documents they refer to.

Unlimited aliasing, see Synonym lists

URI (Uniform Resource Identifier): unique identifier of the location of a resource. In many cases the URI will be a URL (that is, a website address, for example: http://www.aussi.org).

URL: Uniform Resource Locator – the address of a webpage. For example, http://www.aussi.org (also written as www.aussi.org).

Usability: efficiency with which a user can perform required tasks with a product, for example, a website. Usability can be measured objectively via performance errors and productivity, and subjectively via user preferences and interface characteristics. Web design features that affect usability include navigation design and content layout.

User: also known as visitor, participant, actor, searcher, employee, customer, and client.

Visualisation: graphical presentation of information, often dependent on categorisation or clustering techniques to bring out patterns in the information.

W3C/World Wide Web Consortium: an international consortium of companies which develops specifications, guidelines, software and tools for the web using open standards to ensure interoperability. It is the chief standards body for HTTP and HTML. The W3C was founded in 1994 by Tim Berners-Lee, the original creator of the web.

Web: a vast collection of files accessible to the public through the Internet, viewed through a browser, and connected by hypertext links. Also known as the World Wide Web, or W3.

Web crawler, see Spider

Web indexing: a) search engine indexing of the world wide web; b) creation of metadata; c) organisation of web links by category; d) creation of website indexes.

Website indexing: the creation of a back-of-book-style index to an individual website (or subsite, web document or intranet).

XML: a relative of HTML that specifies not only the appearance, but also the type of material on display. For example, names can be given the XML tags and making for more flexible searching and display.RDF is written using XML.

Last Updated on Friday, 17 April 2009 00:22