CSE 538 : Web Search and Mining
 

Course News
Lectures
  

Lecture

Contents & Refs

Papers

 

Week 1

INTRODUCTION (PPT) WEB TECHNOLOGIES (PPT) (Baldi)
  • Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale hypertextual Web search engine." Computer networks and ISDN systems 30, no. 1 (1998): 107-117. (HTML)
  • Singhal, Amit. "Modern information retrieval: A brief overview." IEEE Data Eng. Bull. 24, no. 4 (2001): 35-43. (PDF)
  • Broder, Andrei. "A taxonomy of web search." In ACM Sigir forum, vol. 36, no. 2, pp. 3-10. ACM, 2002. (PDF)
Week 2 WEB CRAWLING (PPT) (Ch8-Bing Liu)
  • Web Crawling and Basic Text Analyis (PPT)  by Hongning Wang
  • IIR Ch20 (PDF)
  • Open Source Search Engines in Java 
    - http://java-source.net/open-source/search-engines 
    - http://www.manageability.org/blog/stuff/open-source-web-crawlers-java
  • Start with Nutch – http://nutch.apache.org/
  • Index directly to SOLR
  • Create a seed list from DMOZ rdf
  • http://www.dmoz.org/rdf.html
  • http://wiki.apache.org/nutch/NutchTutorial 
  • Entity Extraction
  • –LingPipe http://alias-i.com/lingpipe/
  • –OpenNLP http://incubator.apache.org/opennlp/
  • Entity Identification / Taxonomies
  • –Freebase http://www.freebase.com/
  • Basic Web Page Parser –https://github.com/pjaol/Webcrawler
  • Example of OpenNLP usage
  • https://github.com/pjaol/entity_extractor
  • Wikiperida: http://en.wikipedia.org/wiki/Web_crawler
    • Olston, Christopher, and Marc Najork. "Web crawling." Foundations and Trends in Information Retrieval 4, no. 3 (2010): 175-246. (PDF)
    • Abiteboul, Serge, Mihai Preda, and Gregory Cobena. "Adaptive on-line page importance computation." In Proceedings of the 12th international conference on World Wide Web, pp. 280-290. ACM, 2003. (PDF)
    • Rendle, Steffen, Christoph Freudenthaler, and Lars Schmidt-Thieme. "Factorizing personalized markov chains for next-basket recommendation." In Proceedings of the 19th international conference on World wide web, pp. 811-820. ACM, 2010. (PDF)
    • Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation of a high-performance distributed web crawler." In Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 357-368. IEEE, 2002. (PDF)
    • Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg. "Automatic resource compilation by analyzing hyperlink structure and associated text." Computer Networks and ISDN Systems 30, no. 1 (1998): 65-74. (HTML)
    • Hull, David A. "Stemming algorithms: A case study for detailed evaluation." JASIS 47, no. 1 (1996): 70-84. (PDF)
    • Xu, Jinxi, and W. Bruce Croft. "Corpus-based stemming using cooccurrence of word variants." ACM Transactions on Information Systems (TOIS) 16, no. 1 (1998): 61-81. (PDF)

    Week 3
    BOOLEAN MODEL (PPT)
    - IIR Ch. 1
    - Shakespeare plays
    TERMS AND POSTINGS (PPT)
    - IIR Ch. 2 
    - http://zembereknlp.blogspot.com.tr/

    - Porter's stemmer (MIR), Porter stemming algorithm (Official) 
    - A skip list cookbook (Pugh 1990) 
    - Fast phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
    - Efficient phrase querying with an auxiliary index (Bahle, Williams, Zobel 2002)
    Week 4 DICTIONARIES AND TOLERANT RETRIEVAL (PPT)
    - IIR Ch. 3 
    -Techniques for automatically correcting words in text (Kukich 1992) 
    -Finding approximate matches in large lexicons (Zobel and Dart 1995) 
    -Efficient Generation and Ranking of Spelling Error Corrections (Tillenius) 
    -How to write a spelling corrector (Peter Norvig)
    Week 5 INDEX CONSTRUCTION (PPT)
    - IIR Ch. 4 


    INDEX COMPRESION (PPT)

    - IIR Ch. 5

         - MapReduce: simplified data processing on large clusters (Dean and Ghemawat 2004) 
         - Efficient single-pass index construction for text databases (Heinz and Zobel 2003) 
        - Compression of inverted indexes for fast query evaluation (Scholer et al. 2002)  
         - Inverted index compression using word-aligned binary codes (Anh and Moffat 2005) 

    • Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38, no. 2 (2006): 6. (PDF)
    • Scholer, Falk, Hugh E. Williams, John Yiannis, and Justin Zobel. "Compression of inverted indexes for fast query evaluation." In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 222-229. ACM, 2002. (PDF)
    • Yan, Hao, Shuai Ding, and Torsten Suel. "Inverted index compression and query processing with optimized document ordering." In Proceedings of the 18th international conference on World wide web, pp. 401-410. ACM, 2009. (PDF)
    Week 6 SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL (PPT)

    IIR 6.2 - 6.4.3

    IR Models from Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition (PDF)
          - Cosine Similarity 
          - Exploring the similarity space 
          - Okapi BM25

    • Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24, no. 5 (1988): 513-523. (PDF)
    • Raghavan, Vijay V., and SK Michael Wong. "A critical analysis of vector space model for information retrieval." Journal of the American Society for information Science 37, no. 5 (1986): 279-287. (PDF)
    • Singhal, Amit, Chris Buckley, and Mandar Mitra. "Pivoted document length normalization." In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 21-29. ACM, 1996. (PDF)
    • Turney, Peter D., and Patrick Pantel. "From frequency to meaning: Vector space models of semantics." Journal of artificial intelligence research 37, no. 1 (2010): 141-188. (PDF)
    • Sahlgren, Magnus. "The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces." (2006). (PDF)
    Week 7 SCORES IN A COMPLETE SEARCH SYSTEM (PPT)
    IIR Ch. 7
     
    Week 8    
    Week 9    
    Week 10    
    Week 11    
      
    Presentations
     
    HomeWorks
    Projects
     
    Class Resources

    Similar Courses Useful Links Conferences