Hosted by Sourceforge TWiki > DLibrary > MatulyasQuestions ( vs. r1.1) TWiki webs:
Main | TWiki | Know | Sandbox
DLibrary . { Changes | Index | Search | Go }
 <<O>>  Difference Topic MatulyasQuestions (r1.1 - 18 Nov 2003 - RjHonicky)
Added:
>
>

%META:TOPICINFO{author="honicky" date="1069178978" format="1.0" version="1.1"}% %META:TOPICPARENT{name="DLibraryDesign"}%

Data Collection

  • Prof Brewer has a trace of web-traffic from the router here at Berkeley: old
  • We might want to build a corpus to test the library type of access: do this by playing back traces

Data Interpretation

  • Lucene can’t handle anything except plain text. So, if we want to go ahead with the corpus idea, we will need to have specific content handlers.
    RJ: handles pdf and html too

Architecture

Proxy to Internet Data Distribution mechanism Indexing mechanism Searching Mechanism Proxy to user interface User Interface

Issues

  • Proxy to the Internet
    • No issues for us. We will be using Smart Cache as Fred suggested. However, it probably does more that we want. So we need to be careful
      Fred, Smart Cache looks fine, but the code is very poorly written, and most comments are in czech. How wedded to Smart Cache are you?
  • Data Distribution Mechanism
    • Redundancy: We can post phone this one for now: now no redundancy, it is a cache, not a data source
    • Which machine stores the data? The machine with long term storage may be different than the machine that requests this data. hash the url
    • Do we hash on keywords? In this case, how do we determine the keywords? Do we use a subset of the index that Lucene generates for the document? no, hash on url
    • Does each machine cache independently?: There is no overlap in the caches: the set of documents cached by each server is disjoint.
  • Indexing mechanism
    • We use Lucene for now
  • Searching mechanism
    • We use Lucene’s search for searching on an individual machine
  • Proxy to the user interface
    • Does it contact the indexing interface or does it contact the searching mechanism? both
    • Depending upon whether we hash keywords or allow every machine to cache what it wants or something else, this proxy would need to one, all or a certain number of the machines part of this infrastructure. This sentence doesn’t make sense, but I assume you are asking whether each machine has its own proxy: yes
    • How do we merge at the end machine? Do we get equal number of hits from each cache, do we get all the hits, or do we get more hits from a cache that has more relevant data and less from which has less relevant data?: lucene handles this
  • User Interface
    • I would say that this is not a big issue, at least for the meanwhile. Indeed, however it must be done. How much time?
    • Need to integrate with a web-browser? no
    • Should differentiate between offline and online content? This begs the question: how do we know what the online content is supposed to be? Do we cache results of online searches (and some documents that we fetched as opposed to not caching at all or caching all the documents that the online search returned)?
      Fred is doing this right now

  1. For the hash algorithm, we should use the one I just published: I’ll send it
  2. For cache replacement, LRU

-- RjHonicky - 18 Nov 2003


Topic MatulyasQuestions . { View | Diffs | r1.1 | More }
Revision -
Revision r1.1 - 18 Nov 2003 - 18:09 GMT - RjHonicky
Copyright © 1999-2003 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback.