Lecture
|
Contents & Refs
|
Papers |
Week 1
|
INTRODUCTION
(PPT)
WEB TECHNOLOGIES (PPT) (Baldi)
|
-
Brin, Sergey, and Lawrence Page. "The anatomy of a large-scale
hypertextual Web search engine." Computer networks and ISDN systems
30, no. 1 (1998): 107-117. (HTML)
-
Singhal, Amit. "Modern information retrieval: A brief overview."
IEEE Data Eng. Bull. 24, no. 4 (2001): 35-43. (PDF)
-
Broder, Andrei. "A taxonomy of web search." In ACM Sigir forum,
vol. 36, no. 2, pp. 3-10. ACM, 2002. (PDF)
|
Week 2 |
WEB CRAWLING (PPT)
(Ch8-Bing Liu)
Web Crawling and Basic Text Analyis (PPT)
by Hongning Wang
IIR Ch20 (PDF)
Open Source Search Engines in Java
- http://java-source.net/open-source/search-engines
- http://www.manageability.org/blog/stuff/open-source-web-crawlers-java
Start with Nutch – http://nutch.apache.org/
Index directly to SOLR
Create a seed list from DMOZ rdf
http://www.dmoz.org/rdf.html
http://wiki.apache.org/nutch/NutchTutorial
Entity Extraction
–LingPipe http://alias-i.com/lingpipe/
–OpenNLP http://incubator.apache.org/opennlp/
Entity Identification / Taxonomies
–Freebase http://www.freebase.com/
Basic Web Page Parser –https://github.com/pjaol/Webcrawler
Example of OpenNLP usage
–https://github.com/pjaol/entity_extractor
Wikiperida: http://en.wikipedia.org/wiki/Web_crawler
|
-
Olston, Christopher, and Marc Najork. "Web crawling." Foundations
and Trends in Information Retrieval 4, no. 3 (2010): 175-246. (PDF)
-
Abiteboul, Serge, Mihai Preda, and Gregory Cobena. "Adaptive on-line
page importance computation." In Proceedings of the 12th international
conference on World Wide Web, pp. 280-290. ACM, 2003. (PDF)
-
Rendle, Steffen, Christoph Freudenthaler, and Lars Schmidt-Thieme. "Factorizing
personalized markov chains for next-basket recommendation." In
Proceedings of the 19th international conference on World wide web,
pp. 811-820. ACM, 2010. (PDF)
-
Shkapenyuk, Vladislav, and Torsten Suel. "Design and implementation
of a high-performance distributed web crawler." In Data Engineering,
2002. Proceedings. 18th International Conference on, pp. 357-368. IEEE,
2002. (PDF)
-
Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan,
David Gibson, and Jon Kleinberg. "Automatic resource compilation
by analyzing hyperlink structure and associated text." Computer
Networks and ISDN Systems 30, no. 1 (1998): 65-74. (HTML)
-
Hull, David A. "Stemming algorithms: A case study for detailed
evaluation." JASIS 47, no. 1 (1996): 70-84. (PDF)
-
Xu, Jinxi, and W. Bruce Croft. "Corpus-based stemming using cooccurrence
of word variants." ACM Transactions on Information Systems (TOIS)
16, no. 1 (1998): 61-81. (PDF)
|
Week 3 |
BOOLEAN MODEL (PPT)
- IIR
Ch. 1
- Shakespeare
plays
TERMS AND POSTINGS (PPT)
- IIR
Ch. 2
|
-
http://zembereknlp.blogspot.com.tr/
- Porter's
stemmer (MIR), Porter
stemming algorithm (Official)
- A
skip list cookbook (Pugh 1990)
- Fast
phrase querying with combined indexes (Williams, Zobel, Bahle 2004)
- Efficient
phrase querying with an auxiliary index (Bahle,
Williams, Zobel 2002) |
Week 4 |
DICTIONARIES AND TOLERANT RETRIEVAL (PPT)
- IIR
Ch. 3 |
-Techniques
for automatically correcting words in text (Kukich 1992)
-Finding
approximate matches in large lexicons (Zobel and Dart 1995)
-Efficient
Generation and Ranking of Spelling Error Corrections (Tillenius)
-How
to write a spelling corrector (Peter Norvig) |
Week 5 |
INDEX CONSTRUCTION (PPT)
- IIR
Ch. 4
INDEX COMPRESION (PPT)
- IIR
Ch. 5
|
- MapReduce:
simplified data processing on large clusters (Dean and Ghemawat 2004)
- Efficient
single-pass index construction for text databases (Heinz and Zobel 2003)
- Compression
of inverted indexes for fast query evaluation (Scholer et al. 2002)
- Inverted
index compression using word-aligned binary codes (Anh and Moffat 2005)
-
Zobel, Justin, and Alistair Moffat. "Inverted files for text search
engines." ACM computing surveys (CSUR) 38, no. 2 (2006): 6. (PDF)
-
Scholer, Falk, Hugh E. Williams, John Yiannis, and Justin Zobel.
"Compression of inverted indexes for fast query evaluation." In
Proceedings of the 25th annual international ACM SIGIR conference on
Research and development in information retrieval, pp. 222-229. ACM,
2002. (PDF)
-
Yan, Hao, Shuai Ding, and Torsten Suel. "Inverted index compression
and query processing with optimized document ordering." In
Proceedings of the 18th international conference on World wide web,
pp. 401-410. ACM, 2009. (PDF)
|
Week 6 |
SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL (PPT)
IIR 6.2 - 6.4.3
IR Models from Chap 03: Modeling, Baeza-Yates & Ribeiro-Neto, Modern
Information Retrieval, 2nd Edition (PDF) |
-
Cosine Similarity
-
Exploring the similarity space
-
Okapi BM25
-
Salton, Gerard, and Christopher Buckley. "Term-weighting approaches
in automatic text retrieval." Information processing & management
24, no. 5 (1988): 513-523. (PDF)
-
Raghavan, Vijay V., and SK Michael Wong. "A critical analysis of
vector space model for information retrieval." Journal of the
American Society for information Science 37, no. 5 (1986): 279-287. (PDF)
-
Singhal, Amit, Chris Buckley, and Mandar Mitra. "Pivoted document
length normalization." In Proceedings of the 19th annual
international ACM SIGIR conference on Research and development in
information retrieval, pp. 21-29. ACM, 1996. (PDF)
-
Turney, Peter D., and Patrick Pantel. "From frequency to meaning:
Vector space models of semantics." Journal of artificial
intelligence research 37, no. 1 (2010): 141-188. (PDF)
-
Sahlgren, Magnus. "The Word-Space Model: Using distributional
analysis to represent syntagmatic and paradigmatic relations between
words in high-dimensional vector spaces." (2006). (PDF)
|
Week 7 |
SCORES IN A COMPLETE SEARCH SYSTEM (PPT)
IIR Ch. 7 |
|
Week 8 |
EVALUATION IN INFORMATION RETRIEVAL (PPT)
Example (PDF)
IIR Ch. 8
|
-
Borlund, Pia. "The IIR evaluation model: a framework for evaluation
of interactive information retrieval systems." Information research
8, no. 3 (2003). (PDF)
-
Clarke, Charles LA, Maheedhar Kolla, Gordon V. Cormack, Olga
Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon.
"Novelty and diversity in information retrieval evaluation." In
Proceedings of the 31st annual international ACM SIGIR conference on
Research and development in information retrieval, pp. 659-666. ACM,
2008. (PDF)
-
Smucker, Mark D., James Allan, and Ben Carterette. "A comparison of
statistical significance tests for information retrieval
evaluation." In Proceedings of the sixteenth ACM conference on
Conference on information and knowledge management, pp. 623-632.
ACM, 2007. (PDF)
-
Buckley, Chris, and Ellen M. Voorhees. "Retrieval evaluation with
incomplete information." In Proceedings of the 27th annual
international ACM SIGIR conference on Research and development in
information retrieval, pp. 25-32. ACM, 2004. (PDF)
-
Carterette, Ben, James Allan, and Ramesh Sitaraman. "Minimal test
collections for retrieval evaluation." In Proceedings of the 29th
annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 268-275. ACM, 2006. (PDF)
Common evaluation measures (TREC)
Evaluation methods in text categorization
The use of MMR, diversity-based reranking for
reordering documents and producing summaries (Carbonell and
Goldstein 1998) |
Week 9 |
RELEVANCE FEEDBACK AND QUERY EXPANSION (PPT)
IIR Ch. 9 |
|
Week 10 |
SOCIAL NETWORK ANALYSIS (PPT)
(Ch7-Bing Liu) |
|
Week 11 |
OPINION MINING AND SENTIMENT ANALYSIS (PPT, PDF, PDF)
(Ch11-Bing Liu) |
|