ANNOTATING WEB SEARCH RESULTS
Need help with a related project topic or New topic? Send Us Your Topic
DOWNLOAD THE COMPLETE PROJECT MATERIAL
ANNOTATING WEB SEARCH RESULTS
ABSTRACT
With over millions of pages, the Web has grown into a massive information resource. This data is presented in the form of documents, photographs, videos, and text.
It is a regular challenge to obtain the desired information when dealing with such massive amounts of data. Users must frequently utilise search engines to find the right stuff they are looking for on the Internet. Searching can be done manually using available platforms such as Google or automatically using site crawlers.
Because the semantic web is not structured, search results may contain a variety of sorts of information pertaining to the same question. Sometimes these results cannot be analysed immediately to suit the specific interpretation need.
The search result records (SRRs) received from the Web as a result of human or automatic searches are web pages containing results from underlying databases.
Such outcomes can then be used in a variety of applications such as data collection, price comparison, and so on. As a result, the SRRs must be machine processable.
To accomplish this, it is critical that the SRRs be annotated meaningfully. Annotation enhances the usefulness of SRRs by allowing the collected data to be saved for subsequent analysis and making the collection easier to read and understand.
Annotation also prepares the data for visualisation. SRRs with similar topics are grouped together, making it easy to compare, analyse, and browse the collection.
The goal of this study is to discover how Web search results can be automatically annotated and restructured to enable data visualisation for users in a certain domain of discourse.
A case study application is developed that employs a web crawler to retrieve web pages on any topic related to public health. Mr. Emanuel Onu’s work in the project “Proposal of a Tool to Enhance Competitive Intelligence on the Web” is the basis for this research.
CHAPTER ONE:
INTRODUCTION
People from various walks of life use the internet for a variety of purposes, including buying and selling products, social networking, digital libraries, news, and so on.
Researchers require information from digital libraries and other online document repositories in order to conduct research and share information; scholars require books in order to obtain information and knowledge; people communicate with one another via email via the Web;
others use social media to exchange information as well as casual chat; and some conduct transactions such as purchasing items and paying bills via the web.
The World Wide Web is now the primary storehouse for “all kinds of information” and has been very successful in transmitting information to people.
Many database applications, such as e-commerce and digital libraries, have made the Web their primary medium. Many database applications store data in massive databases that users may access, query, and update through the Web.
The advancement of hardware technology has resulted in a rise in the storage capacity of computers and servers. As a result, many web servers save a large amount of data on their storage systems. Users can upload photographs, videos, and other documents to some social media networks, such as Facebook[1].
YouTube [2] users can upload movies of varied lengths to their servers. Other automatic systems acquire a large amount of data on a daily basis. Bank systems, for example, must retain daily ATM transactions as well as other customers’ transactions.
Some monitoring systems collect data about some aspect of life, such as climate change, while others, such as online shopping systems, maintain track of the clients’ everyday purchasing experiences.
These are just a handful of the methods that have resulted in a massive amount of information and papers being available on the Web.
However, access to this vast collection of knowledge has been limited to browsing and searching due to the heterogeneity and lack of organisation of Web information sources.
To view a document, one must enter the URL (Universal Resource Locator) into a Web Browser or utilise a search engine. The first method is appropriate when you know exactly what you’re looking for and where to find it on the Internet.
However, this is rarely the case, and as a result, many Web users use search engines to find specific content. There are software systems that require a user to manually enter a search phrase, and the search engine gets documents based on the word entered by the user, while other automated search engines use a Web Crawler.
There are several well-known web-based search engines that index web documents and are accessible to Web users. Google, Yahoo, AltaVista, and many others are the most popular.
Such systems search through a collection of documents sourced from both the Surface Web (which is indexed by regular search engines) and the Deep Web (which requires the use of specialised tools to reach).
Most people profit from such systems when looking for unknown information or tracing a website they know but can’t recall the URL for.
Even so, some business disciplines, such as Competitive Intelligence [3], require a specific sort of information (domain specific) in order to make strategic business decisions. In such cases, several tools are created to aid in information gathering and processing.
Several alternative approaches for searching and retrieving information to acquire intelligence are also effective in such domain-specific domains. Manually browsing the Internet, for example, could be the simplest technique for carrying out a Competitive Intelligence work.
Manual Internet browsing at a fair level ensures the quality of documents acquired, which in turn improves the quality of discoverable knowledge.[4]
However, the difficulty here is that a significant amount of time is invested. According to Onu, a survey of over 300 Competitive Intelligence specialists reveals that data gathering is the most time-consuming task in typical Competitive Intelligence projects, accounting for more than 30% of overall project time.
In this instance, it is mentally taxing and stressful for Competitive Intelligence specialists to manually search the Internet to read the content on every page of a Website in order to identify important information and also to synthesise the information.
There is certainly a high need for data collection from multiple Websites around the Internet. For instance, in an online shopping system that gathers multiple result records from several item sites, it is necessary to verify whether any two items retrieved in the search result records refer to the same item.
To accomplish this, the ISBN can be compared for a book online shopping system. If ISBNs are not available, their titles and authors may be substituted.
A similar system is intended to list the pricing of an item from each site. As a result, the system must understand the semantics of each data unit.
Unfortunately, the semantic labels of data units are sometimes absent from the result pages. Figure X, for example, contains no semantic labels for the values title, author, publisher, and so on.
Semantic labels for data units are crucial not only for the above-mentioned record linkage task, but also for putting collected search result records into a database table (e.g., Deep web crawlers) for further analysis. Early applications necessitate enormous human effort to manually mark data units, greatly limiting their scalability.
Various tools have been developed to assist in the search, collecting, analysis, categorization, and visualisation of a huge collection of web content. Mr. Emmanuel Onu’s paper “Proposal of a Tool to Enhance Competitive Intelligence on the Web” proposes one such instrument, which is known as the CI Web Snooper.
This paper is built around this tool. This study is a continuation of the work begun in the preceding publication.
The CI Web Snooper is an Internet search and retrieval tool that can be used for information collecting and knowledge extraction.
It employs a real-time search technique to ensure that the information it obtains from the Web is up to date. It consists of four primary components:
the User Interface, the Thesaurus Model, the Web Crawler, and the Indexer. The User Interface enables the user to enter a search query as well as seed URLs for the Web Crawler to use in its search.
The Thesaurus Model is used to model the domain of interest and is essential for query reformulation and Web page indexing.
The Web Crawler component is in charge of discovering and downloading Web pages using the Breadth-First search technique, which begins with the URLs supplied by the user. Figure 1 depicts the structure of the CI Web Snooper, whereas Figure 2 depicts the collected results.
Figure 1: The CI Web Snooper’s Structure
Figure 2: Web pages downloaded
The primary goal of this project is to automatically annotate a collection of websites acquired from the Internet for use in a domain-specific activity.
Thus, a framework for automatically annotating and reorganising obtained websites from a search query will be shown. The following sections comprise this paper: Section 2 includes a related work as well as a discussion.
Section 3 provides a full discussion of the approach utilised in constructing a restructure and annotation platform. Section 4 discusses the platform’s deployment, behaviour, and outcomes. The conclusion is covered in Section 5.
Need help with a related project topic or New topic? Send Us Your Topic