I liked this weeks articles, and they were on a topic I'm interested in so that was even better. I particularly liked how it wasn't just a bunch of articles on tagging or folksonomy in general, rather articles on different aspects of folksonomy and what it can be used for (i.e. library instruction, academic library, etc.) I did think the articles were pretty basic though, almost bordering on to basic. The chosen articles could have used bit more depth but it was good that they were on things relating to libraries though.
1.) Allan, "Using a Wiki"
This article was about how libraries can use a wiki to make library instruction better for sharing information, facilitating collaboration in the creation of resources, and efficiently dividing up work loads among different librarians. A wiki is kind of like a word document where you can edit text and attach files.
This article focuses mainly on how wikis can be used to help in library instruction, whether from the library itself or in a particular class with the help of a professor. Library instruction wikis have to main uses, sharing of knowledge and ability and to cooperate in creating resources. Wikis are extremely easy to use and free to create. Once you create one you can invite other users to participate in it and they can change the wiki. Wikis are beginning to catch on in many different workplaces.
This was an interesting article, yet somewhat basic. I also think that there are many more useful way to use a wiki than in library instruction, yet it would be helpful in this also.
2.)Arch, "Creating the Academic Library Folksonomy"
This article is about social tagging and the advantages to it in libraries. Social tagging is a new phenomenon which allows people to create tags for websites and store them online. This could be very useful to libraries. It could help them better help their users in their research goals and needs. The article then gives some examples of sites that use tagging like delicious.
I had a fairly big problem with this article, especially since it was written in 2007 (fairly recently). The article makes it seem like the only thing tagging is good for is a glorified bookmarking system. It talks about how you can save and tag a website and then retrieve it later on a different computer, much like just bookmarking it to a server so it can be found on other computers. Social tagging compatibilities go way beyond this; this is only a small part of the advantages of being able to tag things. Tagging allows things to be stored and organized in ways the physical world was never able to. With tagging you can organize a book under a lot of different categories instead of it having to be on a specific spot on a shelf. The article touched on the advantages of tagging but was way to narrow and did not begin to show the scope of what is possible with tagging.
3.)Wales "How a ragtag band created Wikipedia"
This was an interesting video, much like the Google one we watched earlier in the semester, as it explained how Wikipedia was formed, its goals, and what it currently is working on. Wikipedia is a free encyclopedia written by thousands of volunteers. The goal of the organization is to provide access to as much information as possible to as many people as possible.
The one thing I liked most about the video is that it addressed the main issues/controversies/myths people have about Wikipedia. The main issue being that many people believe that because anybody can contribute and change a Wikipedia article that it is not reliable, especially because people can edit articles anonymously. The creator says that this is not as big of a problem as people think. The software Wikipedia uses is open ended where everything is left up to volunteers, because of this people police themselves and other users, instead of just creating false articles. Wikipedia maintains a neutral point of view, and again because many people may be working on the same article this is not as hard to do as people may think (even with political issues they can maintain a neutral point of view). The one thing I thought was very interesting is what the creator called the "Google Test" meaning that if the topic of an article does not show up in a google search then it is probably not worthwhile enough to have an encyclopedia article about.
Overall, I liked this video, especially because it gave me some background on a website that I use almost daily.
Friday, November 27, 2009
Week 11 Muddiest Point
When doing link analysis why don't they look at the outgoing links of a website also, instead of just incoming links. It seems like outgoing links would also be a helpful way to analyze a website, instead of just limiting it to incoming links.
Monday, November 23, 2009
Friday, November 20, 2009
Week 11 Readings
So it seems I have been way out of order in doing the blogs and readings. I remember Dr. He saying that the order of content was going to be switched, but when I went to do the readings I completely forgot. I did the readings by the date of them in courseweb and not the order I was supposed to. I hope this won't negatively effect my grade as I have done all of them the past 3 weeks just in the wrong order. It seems that I should be back on track for week 12 and it was only the last 3 weeks that were out of order.
This weeks readings were interesting and on a topic I enjoy. It was a little different since I am out of order and we have already discussed this topic in class, but I still thought the readings were good to read, even if they were late.
1.) Mischo, Digital Libraries: Challenges and Influential Work
Effective search and discovery over open and hidden digital resources is still problematic and challenging. There are differences between providing digital access to collections and actually providing digital library services. This is a very good point, and I liked it a lot. Just simply providing access to a lot of digital collections does not mean you are providing digital library services.
The first significant federal investment into digital library research was in 1994. There has recently been a surge in interest in metasearch or federated search by many different people and institutions. The majority of the rest of this article discussed previous and current research being done in digital libraries and the instiutions doing them.
2.)Paepcke, et al. Dewey Meets Turing
In 1994 the NSF launched the Digital Library Initiative (DLI), which united libraries and computer scientists together to work on the project. The invention and growth of the World Wide Web changed many of their initial ideas. The web very instantly blurred the distinction between the consumers and the producers of information.
One point in the article I found very interesting was the fact that the computer scientists didn't like all the restictions placed upon them by the publishers. They were not allowed to make public all of their work, becuase that would then make public all of the materials in it (i.e. the publishers copyrighted material). This is interesting because it showed the light of all the digital copyright laws to people that may not have understood all the restrictions.
3.) Lynch Institutional Repositories
Institutional repositories are a new strategy that allows universitites to accelerate changes in scholarship and scholarly communication. The author defines instituional repositories as a set of services a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. This includes preservation of materials when needed, organization, and access or distribution. He thinks a good IR should contain materials from both faculty and students and both research and training materials.
Universities are not doing a good job facilitating new forms of scholarly communication. Faculty are better at creating ideas, not being system administrators and dissemenators of their works. IR's could solve these problems. They address both questions of short term access and long term preservation, and have the advantage of being able to maintain data as well.
The author sees some problems where IR's can go astray or become counterproductive: if the IR is made for administration to exercise control over what had been a faculty controlled work; one can't overload the infrastructure with distracting and irrelevant policies; and don't implelment the IR's hasitly. Just because other universities are implementing new IR's doesn't mean you should rush into it and start one yourself. These are points the author believes universities need to look closely at when implementing institutional repositories.
IR's promote progress in infrastructure standards in many different ways, the author gives three examples. Preservable formats - the things in the IR should be preserved, different institutions will do this in different ways though. Identifiers - reference materials in IR's will be important in scholarly dialogue and record. Rights Documents and Management - management of rights for digital materials will be essential. You need a way to document rights and permissions of the works.
I liked the article on IR's the best. I'm interested in the many advantages of institutional repositories and how to best implement them and this article was extremely informative
This weeks readings were interesting and on a topic I enjoy. It was a little different since I am out of order and we have already discussed this topic in class, but I still thought the readings were good to read, even if they were late.
1.) Mischo, Digital Libraries: Challenges and Influential Work
Effective search and discovery over open and hidden digital resources is still problematic and challenging. There are differences between providing digital access to collections and actually providing digital library services. This is a very good point, and I liked it a lot. Just simply providing access to a lot of digital collections does not mean you are providing digital library services.
The first significant federal investment into digital library research was in 1994. There has recently been a surge in interest in metasearch or federated search by many different people and institutions. The majority of the rest of this article discussed previous and current research being done in digital libraries and the instiutions doing them.
2.)Paepcke, et al. Dewey Meets Turing
In 1994 the NSF launched the Digital Library Initiative (DLI), which united libraries and computer scientists together to work on the project. The invention and growth of the World Wide Web changed many of their initial ideas. The web very instantly blurred the distinction between the consumers and the producers of information.
One point in the article I found very interesting was the fact that the computer scientists didn't like all the restictions placed upon them by the publishers. They were not allowed to make public all of their work, becuase that would then make public all of the materials in it (i.e. the publishers copyrighted material). This is interesting because it showed the light of all the digital copyright laws to people that may not have understood all the restrictions.
3.) Lynch Institutional Repositories
Institutional repositories are a new strategy that allows universitites to accelerate changes in scholarship and scholarly communication. The author defines instituional repositories as a set of services a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members. This includes preservation of materials when needed, organization, and access or distribution. He thinks a good IR should contain materials from both faculty and students and both research and training materials.
Universities are not doing a good job facilitating new forms of scholarly communication. Faculty are better at creating ideas, not being system administrators and dissemenators of their works. IR's could solve these problems. They address both questions of short term access and long term preservation, and have the advantage of being able to maintain data as well.
The author sees some problems where IR's can go astray or become counterproductive: if the IR is made for administration to exercise control over what had been a faculty controlled work; one can't overload the infrastructure with distracting and irrelevant policies; and don't implelment the IR's hasitly. Just because other universities are implementing new IR's doesn't mean you should rush into it and start one yourself. These are points the author believes universities need to look closely at when implementing institutional repositories.
IR's promote progress in infrastructure standards in many different ways, the author gives three examples. Preservable formats - the things in the IR should be preserved, different institutions will do this in different ways though. Identifiers - reference materials in IR's will be important in scholarly dialogue and record. Rights Documents and Management - management of rights for digital materials will be essential. You need a way to document rights and permissions of the works.
I liked the article on IR's the best. I'm interested in the many advantages of institutional repositories and how to best implement them and this article was extremely informative
Week 10 Muddiest Point
I understand that XML is a general markup language and can be used for many other things besides building webpages. Why is XML preferred for building webpages specifically though? It seems that HTML is much less complicated and hence less time consuming to use. XML is much more in depth and takes a lot more effort to produce a webpage. Maybe I'm just more familiar with writing in HTML as opposed to XML but it seems XML is more complicated and takes a lot more effort to produce a webpage.
Saturday, November 14, 2009
Week 10 Readings
These readings were on a topic that I'm not overly familiar with, so I learned a lot. I knew of the basics behind search engines, but not that many details on how they actually worked.
1.)Hawking - Web Search Engines
Part 1.
Search engines cannot and should not attempt to index every page on the world wide web. To be cost-effective they must reject low value automated content. Search engines require large scale replication to handle the large input. Currently search engines crawl/index approximately 400 Terabytes of data. Crawling at 10GB/s it would take 10 days to do a full crawl of this much information. Crawlers use "seeds" to begin their search, and then look through the link for URL's they currently haven't indexed.
Crawlers must address many issues. Speed - it would take too long for each crawler to individually crawl each website so they only crawl the URL's in which they are responsible for. Politeness - only one crawler goes to a URL at a time so it doesn't overload the sites servers. Exclude Content - they must respect the robot file that authorizes whether or not they can crawl the website. Duplicate Content - it is simple for a search engine to see if two websites have the same text but they must make distinctions between different URL's, dates, etc., they need more sophisticated measures to do this. Continous Crawling - How often a webiste is crawled/indexed is determined on numerous different factors, not just stactic measures. Spam Regulation - websites can artifically inflate their status by adding links to them so crawlers will see them better.
Part 2.
An indexer creates an inverted file in two phases - scanning and inversion. The internets vocabulary is very large - in contains documents in all languages as well as made up words such as acronyms, trademarks, proper names, etc. Search engines can use compression to reduce demands on disk space and memory. Link popularity score says that search engines assign this from the frequency of incoming links.
The most common query to a search engine is a small number of words without operators. By default search engines return only documents containing all the query words. A results quality can be improved if processors scan the end of lists, then sorts the long list according to a relevance scanning function. MSN Search reportedly takes into account over 300 ranking factors when sorthing their lists.
Search engines also have techniques for speeding up their searches and how fast their results are displayed. Search engines often skip hundreds or thousands of documetns to get to the required documents. They also sotp processing after scanning only a small fraction of the lists. Search engines cache their searches, which precomputes and stores pages for thousands of the most popular websites.
2.)Shreeves - OAI Protocol
The Open Archives Initiative Protocol for Metadata Harvesting has been widely adopted since its initial release in 2001. There are now over 300 active data providers. There mission is "to develop and promote interoperability standards that aim to facilitate the effect and dissemination of content." OAI is divided into repositories, that make the metadata availiable, and harvesters, that harvest the metadata. No one provider can service the entire needs of the public, so specific ones have popped up.
Many OAI registries suffer from a number of shortcomings, typically no search engines, limited browsing, and the fact that they are incomplete. The UIUC research group is trying to address these problems. They are trying to enhance the collection level description to enable better search functions. OAI also has some challenges they are currently dealing with: the metadata, problems with data provider implementations, and lack of communication between service and data providers.
3.) Bergman - Deep Web
This article as a tad bit to technical/scientifical for me, but was still an extremely good article. I didn't necessairly enjoy (or understand) all the technical data and scientific research information in this article, but I really liked the concept and theories behind it. The main point of this article was something I had never really explored or read about in depth. Overall, this article was one of my favorite reads of the semester, purely for the concepts it talked about.
The articles main concept is that most of the webst information is bured deep where search engines are unable to find it. The deep web is very different than the surface web. The deep web consists a lot of searchable databases, that are searched "one at a time," as opposed to searchable (search engines searchable) websites. BrightPlanet tried to quantify the deep web and their statistics actually blew me away. I knew that there was a lot of information that were "hidden" within in the web and Internet but I did not realize the extent of it. The deep web is 400 to 550 times larger than the surface web (which equates to about 7500 TB as opposed to 19 TB). The deep web has about 550 billion documents as opposed to the surface web's 1 billion. The deep web is also about 95% accessible to the public for free.
The internet is more diversified than realized, and the deep web is the major reason for that. The web is really only a portion of the Internet. Deep web covers a broad and releavant amount of topics. The information possesed in the deep web is of very high quality and is growing much faster than the surface web.
1.)Hawking - Web Search Engines
Part 1.
Search engines cannot and should not attempt to index every page on the world wide web. To be cost-effective they must reject low value automated content. Search engines require large scale replication to handle the large input. Currently search engines crawl/index approximately 400 Terabytes of data. Crawling at 10GB/s it would take 10 days to do a full crawl of this much information. Crawlers use "seeds" to begin their search, and then look through the link for URL's they currently haven't indexed.
Crawlers must address many issues. Speed - it would take too long for each crawler to individually crawl each website so they only crawl the URL's in which they are responsible for. Politeness - only one crawler goes to a URL at a time so it doesn't overload the sites servers. Exclude Content - they must respect the robot file that authorizes whether or not they can crawl the website. Duplicate Content - it is simple for a search engine to see if two websites have the same text but they must make distinctions between different URL's, dates, etc., they need more sophisticated measures to do this. Continous Crawling - How often a webiste is crawled/indexed is determined on numerous different factors, not just stactic measures. Spam Regulation - websites can artifically inflate their status by adding links to them so crawlers will see them better.
Part 2.
An indexer creates an inverted file in two phases - scanning and inversion. The internets vocabulary is very large - in contains documents in all languages as well as made up words such as acronyms, trademarks, proper names, etc. Search engines can use compression to reduce demands on disk space and memory. Link popularity score says that search engines assign this from the frequency of incoming links.
The most common query to a search engine is a small number of words without operators. By default search engines return only documents containing all the query words. A results quality can be improved if processors scan the end of lists, then sorts the long list according to a relevance scanning function. MSN Search reportedly takes into account over 300 ranking factors when sorthing their lists.
Search engines also have techniques for speeding up their searches and how fast their results are displayed. Search engines often skip hundreds or thousands of documetns to get to the required documents. They also sotp processing after scanning only a small fraction of the lists. Search engines cache their searches, which precomputes and stores pages for thousands of the most popular websites.
2.)Shreeves - OAI Protocol
The Open Archives Initiative Protocol for Metadata Harvesting has been widely adopted since its initial release in 2001. There are now over 300 active data providers. There mission is "to develop and promote interoperability standards that aim to facilitate the effect and dissemination of content." OAI is divided into repositories, that make the metadata availiable, and harvesters, that harvest the metadata. No one provider can service the entire needs of the public, so specific ones have popped up.
Many OAI registries suffer from a number of shortcomings, typically no search engines, limited browsing, and the fact that they are incomplete. The UIUC research group is trying to address these problems. They are trying to enhance the collection level description to enable better search functions. OAI also has some challenges they are currently dealing with: the metadata, problems with data provider implementations, and lack of communication between service and data providers.
3.) Bergman - Deep Web
This article as a tad bit to technical/scientifical for me, but was still an extremely good article. I didn't necessairly enjoy (or understand) all the technical data and scientific research information in this article, but I really liked the concept and theories behind it. The main point of this article was something I had never really explored or read about in depth. Overall, this article was one of my favorite reads of the semester, purely for the concepts it talked about.
The articles main concept is that most of the webst information is bured deep where search engines are unable to find it. The deep web is very different than the surface web. The deep web consists a lot of searchable databases, that are searched "one at a time," as opposed to searchable (search engines searchable) websites. BrightPlanet tried to quantify the deep web and their statistics actually blew me away. I knew that there was a lot of information that were "hidden" within in the web and Internet but I did not realize the extent of it. The deep web is 400 to 550 times larger than the surface web (which equates to about 7500 TB as opposed to 19 TB). The deep web has about 550 billion documents as opposed to the surface web's 1 billion. The deep web is also about 95% accessible to the public for free.
The internet is more diversified than realized, and the deep web is the major reason for that. The web is really only a portion of the Internet. Deep web covers a broad and releavant amount of topics. The information possesed in the deep web is of very high quality and is growing much faster than the surface web.
Subscribe to:
Posts (Atom)