Structured Text Retrieval & Challenges

Structured Text Collection

Challenge 1 (Information/Doc Unit): What is an appropriate information unit?

– Document may no longer be the most natural unit
–  Components in a document or a whole Web site may be more appropriate
Challenge 2 (Query): What is an appropriate query language?
–  Keyword (free text) query is no longer the only choice
–  Constraints on the structures can be posed (e.g., search in a particular field or matching a URL domain)

 Challenge 3 (Retrieval Model): What is an appropriate model for STR?

–  Ranking vs. selection
  • Boolean model may be more powerful with structures (e.g., domain constraints in Web search)
  • But, ranking may still be desirable
–  Multiple (Potentially Conflicting) Preferences
  • Text-based ranking preferences (traditional IR)
  • Structure-based ranking preferences (e.g., PageRank)
  • Structure-based constraints (traditional DB)
  • Text-based constraints (e.g., )
  • How to combine them?