Felicia Boretzky's LIS 2600 Blog: Week 11 Notes

"Web Search Engines: Part 1 and Part 2," by David Hawking

I felt like the information in this article went right over my head. I just felt like I was not fully grasping the definitions and concepts of crawling and indexing. Also, the graphs in the figures were not as helpful as I thought they were going to be. What I did gain, if I understood it correctly, was that a good "seed" URL, such as Wikipedia, will link to numerous Web sites and these "seeds" are what initializes the crawler. After, the crawler scans the content of this "seed" URL it will add any links to other URLs into the queue. Additionally, it saves the Webpage content for indexing. Then in Part 2 it goes on to explain indexing algorithms. So, basically an inverted file, used by the search engines, is a concatenation (the operation of joining two character strings end-to-end) of the posting lists for each particular word or phrase. The list contains all the ID numbers of the webpage documents that the word is in. In the end, I enjoyed part 2 over part 1, but I honestly understood the simpler explanation from Wikipedia then I did with this article.

"Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting," by Sarah L. Shreeves, et al.

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) was released in 2001 and is a "means to federate access to diverse e-print archives through metadata harvesting and aggregation." Since its release a wide variety of communities had begun to use the protocol for their own specific needs. In a 2003 study it was stated that over 300 active data providers from an array of institutions and domains were using OAI-PMH. The article discusses the use of the protocol within these different communities as well as the challenges and future directions it faces. My favorite part in the article were the three specific examples of communities using the protocol. As a piano player, I was really interested in the Sheet Music Consortium, a collection of free digitized sheet music. I am definitely intrigued to research more about it and to see how the search service progressed.

"The Deep Web: Surfacing Hidden Value," by Michael K. Bergman

This article was the most fascinating to me this week. I never knew there was a "deep Web" and that what we mostly view is just the "surface Web." I was captivated that there was stored additional content on the Web, but could only be accessed by direct request. It made me wonder how can more of the deep Web content get to the surface Web? I also enjoyed the study performed by BrightPlanet where they used their exclusive and unique search technology to quantify the size and importance of deep Web material. I was most surprised by the finding that the deep Web is 400 to 550 times larger than the WWW (surface Web). Also, the finding that the "total quality content of the deep Web is 1,000 to 2,000 times greater than that of the surface web" was surprising.

2 comments:

mdunawaNovember 20, 2010 at 7:12 PM
I liked the Bergman article the best also! The deep web idea is fascinating. I also had the same reaction to the idea of the Sheet Music Consortium. I would also be interested to learn more about that project; it's just amazing how the web can benefit so many different purposes and communities!
StaceyNovember 20, 2010 at 7:18 PM
Hi Felicia,

Articles this technical should require a definitions section. The Online Dictionary for Library and Information Science might help. http://lu.com/odlis/index.cfm

Saturday, November 20, 2010

Week 11 Notes

2 comments: