CALI, day 3, Building, Stuart Sierra

by Paul Maharg on 21/06/2008

Final session. I’m dropping in on a stream that I wasn’t even aware of until John Joergensen at lunch pointed out that there’s a conference stream on open-access to primary legal materials. Stuart Sierra, Columbia, is working on The Next WestLaw Killer, by his own admission. Clearly there are links to international projects such as AUSTLII and BAILII, but this one is different. Following on from the Project Posner (posting the Opinions of Posner on the web, around 2,000 of them), they put up other resources, and got into the game that way. Got cases from court web sites, building a bunch of PERL scripts to crawl the pages. Got about 100,000 cases from Appeals & Supreme Cts. Got help from Cornell LII and Justia.

Web server: Apache; application, RAILS + AltLaw; dbase & search is MySQL & Ferret. Last was kind of slow, so he now uses Solr, for caching, maintaining a pool of Lucene threads (searching, indexing), tokenization, stemming & synonyms.

He also wanted to do what he calls ‘reverse citations’, ie text with citations embedded in it. Problematic, because it’s a huge self-join between tables. Worked but difficult and v e r y slow. So he got involved in the Grand Unified Data Project. Showed how he tried endless permutations of relational dbs, RDF, PostgreSQL (at this point I’m barely hanging onto his coat tails…). He now uses MapReduce and Hadoop (it works for Yahoo…). Rents a cluster from AWS (Amazon Web Services).

Issues for the project included Medium-Neutral Citation and privacy issues (persons who appear in court cases now appearing on Google…). He carried out a ‘recall test’ against LEXIS, ie wd he call up the same number of cases on the search parameter, and got virtually the same results (when you strip out the courts he still doesn’t cover because he doesn’t have the primary data). Impressive. He modestly put it down to Lucene, but it’s also due to his huge persistence and programming skill.

The future includes more cases (district court, state supremem cts) PageRank applied to citations, more research tools, and accepting correcions, commentary from legal scholars. The last is particularly interesting to me. Stuart mentioned linking this to SSRN, but this would be an excellent addition to a simulation environment. Eg it’s possible to add this to a sim client environment, so that there can be rich layers of legal information open to clients. Better than having a thin layer of textual info because it can be drilled down into. A more focused version of an OpenLaw project wd be really handy here.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post:

Follow me on