Saturday, November 14, 2009

Week 10 Readings

These readings were on a topic that I'm not overly familiar with, so I learned a lot. I knew of the basics behind search engines, but not that many details on how they actually worked.

1.)Hawking - Web Search Engines

Part 1.
Search engines cannot and should not attempt to index every page on the world wide web. To be cost-effective they must reject low value automated content. Search engines require large scale replication to handle the large input. Currently search engines crawl/index approximately 400 Terabytes of data. Crawling at 10GB/s it would take 10 days to do a full crawl of this much information. Crawlers use "seeds" to begin their search, and then look through the link for URL's they currently haven't indexed.

Crawlers must address many issues. Speed - it would take too long for each crawler to individually crawl each website so they only crawl the URL's in which they are responsible for. Politeness - only one crawler goes to a URL at a time so it doesn't overload the sites servers. Exclude Content - they must respect the robot file that authorizes whether or not they can crawl the website. Duplicate Content - it is simple for a search engine to see if two websites have the same text but they must make distinctions between different URL's, dates, etc., they need more sophisticated measures to do this. Continous Crawling - How often a webiste is crawled/indexed is determined on numerous different factors, not just stactic measures. Spam Regulation - websites can artifically inflate their status by adding links to them so crawlers will see them better.

Part 2.
An indexer creates an inverted file in two phases - scanning and inversion. The internets vocabulary is very large - in contains documents in all languages as well as made up words such as acronyms, trademarks, proper names, etc. Search engines can use compression to reduce demands on disk space and memory. Link popularity score says that search engines assign this from the frequency of incoming links.

The most common query to a search engine is a small number of words without operators. By default search engines return only documents containing all the query words. A results quality can be improved if processors scan the end of lists, then sorts the long list according to a relevance scanning function. MSN Search reportedly takes into account over 300 ranking factors when sorthing their lists.

Search engines also have techniques for speeding up their searches and how fast their results are displayed. Search engines often skip hundreds or thousands of documetns to get to the required documents. They also sotp processing after scanning only a small fraction of the lists. Search engines cache their searches, which precomputes and stores pages for thousands of the most popular websites.

2.)Shreeves - OAI Protocol

The Open Archives Initiative Protocol for Metadata Harvesting has been widely adopted since its initial release in 2001. There are now over 300 active data providers. There mission is "to develop and promote interoperability standards that aim to facilitate the effect and dissemination of content." OAI is divided into repositories, that make the metadata availiable, and harvesters, that harvest the metadata. No one provider can service the entire needs of the public, so specific ones have popped up.

Many OAI registries suffer from a number of shortcomings, typically no search engines, limited browsing, and the fact that they are incomplete. The UIUC research group is trying to address these problems. They are trying to enhance the collection level description to enable better search functions. OAI also has some challenges they are currently dealing with: the metadata, problems with data provider implementations, and lack of communication between service and data providers.

3.) Bergman - Deep Web

This article as a tad bit to technical/scientifical for me, but was still an extremely good article. I didn't necessairly enjoy (or understand) all the technical data and scientific research information in this article, but I really liked the concept and theories behind it. The main point of this article was something I had never really explored or read about in depth. Overall, this article was one of my favorite reads of the semester, purely for the concepts it talked about.

The articles main concept is that most of the webst information is bured deep where search engines are unable to find it. The deep web is very different than the surface web. The deep web consists a lot of searchable databases, that are searched "one at a time," as opposed to searchable (search engines searchable) websites. BrightPlanet tried to quantify the deep web and their statistics actually blew me away. I knew that there was a lot of information that were "hidden" within in the web and Internet but I did not realize the extent of it. The deep web is 400 to 550 times larger than the surface web (which equates to about 7500 TB as opposed to 19 TB). The deep web has about 550 billion documents as opposed to the surface web's 1 billion. The deep web is also about 95% accessible to the public for free.

The internet is more diversified than realized, and the deep web is the major reason for that. The web is really only a portion of the Internet. Deep web covers a broad and releavant amount of topics. The information possesed in the deep web is of very high quality and is growing much faster than the surface web.

No comments:

Post a Comment