CSE 538

CSE 538 : Web Search and Mining

Course News

xxxx

Lectures

Lecture	Contents & Refs	Papers
Week 1	INTRODUCTION (PPT) Web Analytics (Wikipedia) Web Mining (Wikipedia) Raymond Kosala and Hendrik Blockeel, Web mining research: a survey(PDF) Information Retrieval - Wikipedia Web Search Engine - Wikipedia History of Search Engines - From 1945 to Google 2007 WEB TECHNOLOGIES (PPT) (Baldi) An Overview of TCP/IP Protocols and the Internet HTTP1 (PDF), HTTP2 (PDF) HTTP Made Really Easy http://www.w3schools.com/	Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer networks and ISDN systems 30, no. 1 (1998): 107-117. (HTML) Singhal, Amit. "Modern information retrieval: A brief overview." IEEE Data Eng. Bull. 24, no. 4 (2001): 35-43. (PDF) Broder, Andrei. "A taxonomy of web search." In ACM Sigir forum, vol. 36, no. 2, pp. 3-10. ACM, 2002. (PDF)
Week 2	WEB CRAWLING (PPT) (Ch8-Bing Liu) Web Crawling and Basic Text Analyis (PPT) by Hongning Wang IIR Ch20 (PDF) Open Source Search Engines in Java - http://java-source.net/open-source/search-engines - http://www.manageability.org/blog/stuff/open-source-web-crawlers-java Start with Nutch – http://nutch.apache.org/ Index directly to SOLR –http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ Create a seed list from DMOZ rdf http://www.dmoz.org/rdf.html http://wiki.apache.org/nutch/NutchTutorial Entity Extraction –LingPipe http://alias-i.com/lingpipe/ –OpenNLP http://incubator.apache.org/opennlp/ Entity Identification / Taxonomies –Freebase http://www.freebase.com/ Basic Web Page Parser –https://github.com/pjaol/Webcrawler Example of OpenNLP usage –https://github.com/pjaol/entity_extractor Wikiperida: http://en.wikipedia.org/wiki/Web_crawler	Olston, Christopher, and Marc Najork. "Web crawling." Foundations and Trends in Information Retrieval 4, no. 3 (2010): 175-246. (PDF) Abiteboul, Serge, Mihai Preda, and Gregory Cobena. "Adaptive on-line page importance computation." In Proceedings of the 12th international conference on World Wide Web, pp. 280-290. ACM, 2003. (PDF) Rendle, Steffen, Christoph Freudenthaler, and Lars Schmidt-Thieme. "Factorizing personalized markov chains for next-basket recommendation." In Proceedings of the 19th international conference on World wide web, pp. 811-820. ACM, 2010. (PDF) Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation of a high-performance distributed web crawler." In Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 357-368. IEEE, 2002. (PDF) Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg. "Automatic resource compilation by analyzing hyperlink structure and associated text." Computer Networks and ISDN Systems 30, no. 1 (1998): 65-74. (HTML) Hull, David A. "Stemming algorithms: A case study for detailed evaluation." JASIS 47, no. 1 (1996): 70-84. (PDF) Xu, Jinxi, and W. Bruce Croft. "Corpus-based stemming using cooccurrence of word variants." ACM Transactions on Information Systems (TOIS) 16, no. 1 (1998): 61-81. (PDF)
Week 3	BOOLEAN MODEL (PPT) - IIR Ch. 1 - Shakespeare plays TERMS AND POSTINGS (PPT) - IIR Ch. 2	- http://zembereknlp.blogspot.com.tr/ - Porter's stemmer (MIR), Porter stemming algorithm (Official) - A skip list cookbook (Pugh 1990) - Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004) - Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)
Week 4	DICTIONARIES AND TOLERANT RETRIEVAL (PPT) - IIR Ch. 3	-Techniques for automatically correcting words in text (Kukich 1992) -Finding approximate matches in large lexicons (Zobel and Dart 1995) -Efficient Generation and Ranking of Spelling Error Corrections (Tillenius) -How to write a spelling corrector (Peter Norvig)
Week 5	INDEX CONSTRUCTION (PPT) - IIR Ch. 4 INDEX COMPRESION (PPT) - IIR Ch. 5	- MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004) - Efficient single-pass index construction for text databases (Heinz and Zobel 2003) - Compression of inverted indexes for fast query evaluation (Scholer et al. 2002) - Inverted index compression using word-aligned binary codes (Anh and Moffat 2005) Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38, no. 2 (2006): 6. (PDF) Scholer, Falk, Hugh E. Williams, John Yiannis, and Justin Zobel. "Compression of inverted indexes for fast query evaluation." In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 222-229. ACM, 2002. (PDF) Yan, Hao, Shuai Ding, and Torsten Suel. "Inverted index compression and query processing with optimized document ordering." In Proceedings of the 18th international conference on World wide web, pp. 401-410. ACM, 2009. (PDF)
Week 6	SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL (PPT) IIR 6.2 - 6.4.3 IR Models from Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition (PDF)	- Cosine Similarity - Exploring the similarity space - Okapi BM25 Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24, no. 5 (1988): 513-523. (PDF) Raghavan, Vijay V., and SK Michael Wong. "A critical analysis of vector space model for information retrieval." Journal of the American Society for information Science 37, no. 5 (1986): 279-287. (PDF) Singhal, Amit, Chris Buckley, and Mandar Mitra. "Pivoted document length normalization." In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 21-29. ACM, 1996. (PDF) Turney, Peter D., and Patrick Pantel. "From frequency to meaning: Vector space models of semantics." Journal of artificial intelligence research 37, no. 1 (2010): 141-188. (PDF) Sahlgren, Magnus. "The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces." (2006). (PDF)
Week 7	SCORES IN A COMPLETE SEARCH SYSTEM (PPT) IIR Ch. 7
Week 8	EVALUATION IN INFORMATION RETRIEVAL (PPT) Example (PDF) IIR Ch. 8	Borlund, Pia. "The IIR evaluation model: a framework for evaluation of interactive information retrieval systems." Information research 8, no. 3 (2003). (PDF) Clarke, Charles LA, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. "Novelty and diversity in information retrieval evaluation." In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 659-666. ACM, 2008. (PDF) Smucker, Mark D., James Allan, and Ben Carterette. "A comparison of statistical significance tests for information retrieval evaluation." In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 623-632. ACM, 2007. (PDF) Buckley, Chris, and Ellen M. Voorhees. "Retrieval evaluation with incomplete information." In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 25-32. ACM, 2004. (PDF) Carterette, Ben, James Allan, and Ramesh Sitaraman. "Minimal test collections for retrieval evaluation." In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 268-275. ACM, 2006. (PDF) Common evaluation measures (TREC) Evaluation methods in text categorization The use of MMR, diversity-based reranking for reordering documents and producing summaries (Carbonell and Goldstein 1998)
Week 9	RELEVANCE FEEDBACK AND QUERY EXPANSION (PPT) IIR Ch. 9
Week 10	SOCIAL NETWORK ANALYSIS (PPT) (Ch7-Bing Liu)
Week 11	OPINION MINING AND SENTIMENT ANALYSIS (PPT, PDF, PDF) (Ch11-Bing Liu)

Presentations

Nutch
Lucene / Solr
Tika

HomeWorks

Hw1 : Crawler

Projects

Project :

Class Resources

Similar Courses

CS707 Wright (HTML)
CS 572 USC (HTML)
CS 276 Stanford University (HTML)
CS 315 Wellesley (HTML)
CS395T: Concepts of Information Retrieval (and Web Search) - UTexas (HTML)
Information Retrieval and Web Search - UTexas (HTML)
CSCI 572: Information Retrieval and Web Search Engines -USC (HTML)
Information Retrieval and Data Mining - Max Planck (HTML)
Introduction to Information Retrieval (http://www.ims.uni-stuttgart.de/ir/)
CS 6501 Virginia (HTML)

Useful Links

Conferences