Wednesday, November 26, 2003

Hilltop: A Search Engine based on Expert Documents

Hilltop: A Search Engine based on Expert Documents: "Hilltop: A Search Engine based on Expert Documents"

"propose a novel ranking scheme for broad queries that places the most authoritative pages on the query topic at the top of the ranking. Our algorithm operates on a special index of "expert documents." These are a subset of the pages on the WWW identified as directories of links to non-affiliated sources on specific topics. Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page.


Three approaches to improve the authoritativeness of ranked results have been taken in the past:

1) Ranking Based on Human Classification: Human editors have been used by companies such as Yahoo! and Mining Company to manually associate a set of categories and keywords with a subset of documents on the web. These are then matched against the user's query to return valid matches. The trouble with this approach is that: (a) it is slow and can only be applied to a small number of pages, and (b) often the keywords and classifications assigned by the human judges are inadequate or incomplete. Given the rate at which the WWW is growing and the wide variation in queries this is not a comprehensive solution.

2) Ranking Based on Usage Information: Some services such as DirectHit collect information on: (a) the queries individual users submit to search services and (b) the pages they look at subsequently and the time spent on each page. This information is used to return pages that most users visit after deploying the given query. For this technique to succeed a large amount of data needs to be collected for each query. Thus, the potential set of queries on which this technique applies is small. Also, this technique is open to spamming.

3) Ranking Based on Connectivity: This approach involves analyzing the hyperlinks between pages on the web on the assumption that: (a) pages on the topic link to each other, and (b) authoritative pages tend to point to other authoritative pages.
PageRank relies on (b)

Our approach is based on the same assumptions as the other connectivity algorithms, namely that the number and quality of the sources referring to a page are a good measure of the page's quality. The key difference consists in the fact that we are only considering "expert" sources - pages that have been created with the specific purpose of directing people towards resources. In response to a query, we first compute a list of the most relevant experts on the query topic. Then, we identify relevant links within the selected set of experts, and follow them to identify target web pages. The targets are then ranked according to the number and relevance of non-affiliated experts that point to them. Thus, the score of a target page reflects the collective opinion of the best independent experts on the query topic. When such a pool of experts is not available, Hilltop provides no results. Thus, Hilltop is tuned for result accuracy and not query coverage.

Our algorithm consists of two broad phases:

(i) Expert Lookup


We define an expert page as a page that is about a certain topic and has links to many non-affiliated pages on that topic. Two pages are non-affiliated conceptually if they are authored by authors from non-affiliated organizations.

(ii) Target Ranking

We believe a page is an authority on the query topic if and only if some of the best experts on the query topic point to it.

The problem is, how can we distinguish an expert from other types of pages? In other words what makes a page an expert? We felt than an expert page needs to be objective and diverse: that is, its recommendations should be unbiased and point to numerous non-affiliated pages on the subject. Therefore, in order to find the experts, we needed to detect when two sites belong to the same or related organizations.

2.1 Detecting Host Affiliation

We define two hosts as affiliated if one or both of the following is true:
They share the same first 3 octets of the IP address.
The rightmost non-generic token in the hostname is the same.

Keywords - Indexing the experts
document text. URLs located within the scope of a phrase are said to be "qualified" by it. For example, the title, headings (e.g., text within a pair of

tags) and anchor text within the expert page are considered key phrases. The title has a scope that qualifies all URLs in the document. A heading's scope qualifies all URLs until the next heading of the same or greater importance...

For a target to be considered it must be pointed to by at least 2 experts on hosts that are mutually non-affiliated and are not affiliated to the target. For all targets that qualify we compute a target score reflecting both the number and relevance of the experts pointing to it and the relevance of the phrases qualifying the links.

Conclusions
We described a new ranking algorithm
for broad queries called Hilltop and the implementation of a search engine based on it. Given a broad query Hilltop generates a list of target pages which are likely to be very authoritative pages on the topic of the query. This is by virtue of the fact that they are highly valued by pages on the WWW which address the topic of the query. In computing the usefulness of a target page from the hyperlinks pointing to it, we only consider links originating from pages that seem to be experts. Experts in our definition are directories of links pointing to many non-affiliated sites. This is an indication that these pages were created for the purpose of directing users to resources, and hence we regard their opinion as valuable. Additionally, in computing the level of relevance, we require a match between the query and the text on the expert page which qualifies the hyperlink being considered. This ensures that hyperlinks being considered are on the query topic. For further accuracy, we require that at least 2 non-affiliated experts point to the returned page with relevant qualifying text describing their linkage. The result of the steps described above is to generate a listing of pages that are highly relevant to the user's query and of high quality.

No comments: