Pippilongstrings/20100121/protocol
From EU-wiki
Pippilongstrings and beyond - parsing thoughts and ideas in a common bucket.
Second meeting 2010.01.21 @ irc.telecomix.org #pippi
Log:
* http://euwiki.org/Pippilongstrings/20100121
State of play:
* server: callisto, 4 CPUs * cost: 300€ month until April 2010 * results: Cariforum, Canada and South-Korea agreement diffs * code: http://github.com/stef/le-n-x * corpus pippied: TAs = 3200[0-9]L* and http://euwiki.org/Pippilongstrings/selection * corpus to be pippied: "Directory of Community legislation in force" http://eur-lex.europa.eu/en/legis/20100101/index.htm
Local targets:
* measure verbatim implementation rate EU-laws / MS laws * longstring ACTA drafts ASAP
Needz:
* server and co-location * define is the information flow through the system * beutifulsoup for cleaning input docs not in CELEX * long-term wild-ass-visions * direct access to EU doc databases (Official Journal on a couple of DVDs) * harvest and build corpus. also try to locate other potentially interesting public document repositories: the US, national with EU, UN, OECD, etc.
- First meeting notes below ******
* More funny names, like "Herr Nilsson's diary" and "Captain Efraim's treasures" :-) * More long URLs like http://euwiki.org/Performers_the_exclusive_right_to_authorise_or_prohibit_the_broadcasting_by_wireless_means_and_the_communication_to_the_public_of_their_performances,_except_where_the_performance_is_itself_already_a_broadcast_performance_or_is_made_from_a_fixation
Would it also be interesting to identify common large strings with varying sub-parts (eg directive numbers)?
System architecture
here's what: i need to get a sense of a system architecture to be able to contribute properly.
I don't know what "system architecture" would mean in the context of a wiki- based application, but I think iteration is necessary to grow the corpus and the number of longstrings identified.
Multi-word terms
I can provide a code snippet which will extract multi-word sequences from text. Not just longest strings but sequences of at least N with a frequency of at least F. No too exciting code, but actually provides a surprisingly good sense of what is going on the text. It's in Perl or in Java: I have both.
allowing varying subparts certainly is a reasonable and not at all complex idea.
also, like i indicated in the mail message, this links up with stuff we do in the lab that has to do with sentence similarity: there are all kinds of funky things we might inject after some experimentation.
Note: AFAIK, that's all needed. However, the search process should be in a collection of documents, not just one text. So, the frequency should be F documents, not F times in the same text (at least if I understood Erik correctly).
Information flow
but first: what other components are there that i should be aware of? do you already have a scraper and an indexer? what sort of information flow should we be modelling for?
- Treaties are in the wiki - There's tratten-scripts for parsing directives (in EU's website) and wikify them. http://github.com/kattla/tratten-scripts
Instead of wikifying directly, directives could be first indexed. Then analyzed and it could store all longstrings found associated to directives.
Then, produce the wikified version periodically using as source the index and the information found during the analysis process.
Background learning for programmers: Are there related efforts in the wild we should be aware of?
Lucene (and its cousins) are often used as tools in a information retrieval system. And, auto-translation work. Are those two things relevant to this task? Which are some good papers you can point to?
This problem is usually referred to as "Maximal Frequent Sequence" in academia (I think it's also called "Maximal Frequent Itemset", but I'm not sure if the context of this one is the same). You cam find papers searching for these terms.
Those papers are from the String Processing research field, with compression and storage as the primary application areas (including Helena's paper referenced below). They do not actually intend for increased "understanding". The code snippet for extracting multiword terms I use is less competent algorithmically but is coded explicitly for a reason which is similar to what we have here. That said, I have started looking at String Processing papers explicitly to improve the working of this sort of algorithm. I stopped, deciding my time was better served coming up with cool apps, and some grad student would eventually solve the algorithmics :-). My code is based on a paper by Slava Katz in Natural Language Engineering.
As for existing projects implementing it, I haven't found anything. I also looked for plagiarism detection projects, which could implement this under a name different than "Maximal Frequent Sequence", without success either...
a greasemonkey script. which when you open a text on eur-lex, automatically highlightes all fragments that have been copied from other documents. if you hover over these highlights you get a list of documents where this occurs and can click and get there to the other doc, where of course the same sentence is again highlightet but linking back to the original? we add also a header to the document listing all relevant source documents in the beginning. with proper links to the original and the copied text
Resources:
http://euwiki.org/Pippilongstrings
Potentially interesting papers:
- Finding All Maximal Frequent Sequences in Text ( http://www.cs.helsinki.fi/u/hahonen/ham_icml99.ps )
- Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences. ( http://nl.ijs.si/is-ltc06/proc/35_Doucet.pdf )
Evaluating a summarizer for legal text with a large text collection Frank Schilder & Hugo Molina-Salgado http://209.85.129.132/search?q=cache%3A8w1Cro4qyX0J%3Al2r.cs.uiuc.edu%2F~cogcomp%2Fmclc%2FfinalPapers%2FFrank.Schilder_AT_Thomson.com__SchilderMolinaSalgado.pdf+gestalt+pattern+matching&hl=en (this last works on the assumption that the legal text is about only one, or at most a few, different thing )
