TY - CHAP
T1 - Automating information discovery within the invisible web
AU - Sweeney, Edwina
AU - Curran, Kevin
AU - Xie, Ermai
N1 - Publisher Copyright:
© Springer-Verlag London Limited 2010.
PY - 2015
Y1 - 2015
N2 - A Web crawler or spider crawls through the Web looking for pages to index, and when it locates a new page it passes the page on to an indexer. The indexer identifies links, keywords, and other content and stores these within its database. This database is searched by entering keywords through an interface and suitable Web pages are returned in a results page in the form of hyperlinks accompanied by short descriptions. The Web, however, is increasingly moving away from being a collection of documents to a multidimensional repository for sounds, images, audio, and other formats. This is leading to a situation where certain parts of the Web are invisible or hidden. The term known as the “Deep Web” has emerged to refer to the mass of information that can be accessed via the Web but cannot be indexed by conventional search engines. The concept of the Deep Web makes searches quite complex for search engines. Google states that the claim that conventional search engines cannot find such documents as PDFs, Word, PowerPoint, Excel, or any non-HTML page is not fully accurate and steps have been taken to address this problem by implementing procedures to search items such as academic publications, news, blogs, videos, books, and real-time information. However, Google still only provides access to a fraction of the Deep Web. This chapter explores the Deep Web and the current tools available in accessing it.
AB - A Web crawler or spider crawls through the Web looking for pages to index, and when it locates a new page it passes the page on to an indexer. The indexer identifies links, keywords, and other content and stores these within its database. This database is searched by entering keywords through an interface and suitable Web pages are returned in a results page in the form of hyperlinks accompanied by short descriptions. The Web, however, is increasingly moving away from being a collection of documents to a multidimensional repository for sounds, images, audio, and other formats. This is leading to a situation where certain parts of the Web are invisible or hidden. The term known as the “Deep Web” has emerged to refer to the mass of information that can be accessed via the Web but cannot be indexed by conventional search engines. The concept of the Deep Web makes searches quite complex for search engines. Google states that the claim that conventional search engines cannot find such documents as PDFs, Word, PowerPoint, Excel, or any non-HTML page is not fully accurate and steps have been taken to address this problem by implementing procedures to search items such as academic publications, news, blogs, videos, books, and real-time information. However, Google still only provides access to a fraction of the Deep Web. This chapter explores the Deep Web and the current tools available in accessing it.
UR - http://www.scopus.com/inward/record.url?scp=84952324244&partnerID=8YFLogxK
U2 - 10.1007/978-1-84882-628-1_9
DO - 10.1007/978-1-84882-628-1_9
M3 - Chapter
AN - SCOPUS:84952324244
T3 - Advanced Information and Knowledge Processing
SP - 167
EP - 181
BT - Advanced Information and Knowledge Processing
PB - Springer-Verlag London Ltd
ER -