The research I've been working on since my thesis defense in November 2004 takes place in the natural language processing field, at the midway between the building of efficient data-driven parsing models and the development of rich linguistically-sound data sets, that is at the cross-road between machine learning and formal, descriptive, lin- guistics. Since 2007, my research have been carried out within the ALpage team-project, team I formally joined in 2010, and its successor since 2017, Almanach .
My research trends mainly lie around the problematic of robust and multilingual syntactic parsing through, on the one hand, the parsing of languages that pose problems substantially more difficult than French (richer inflectional system, free word order, etc. ) and on the other hand, what out-of-domain parsing, a domain that pose major difficulties for any supervised learning model. The latter research strand is split between work on edited text (Wikipedia, biomedical, ..) and user-generated content (social medias, video games chat log, ..) that has the property of containing many textual non-canonical structures that break the rules of syntax at all analysis layers (morphology, lexical, syntactic, ..), making thus the parsing task extremely complicated.
Transversally, I've also been working a lot to go beyond simple surface syntax structures: I want parsers' output to represent as much as possible a good predicate-argument structure, and for this we need a graph structure :)
To do all this, I needed data sets: (i) to evaluate our parsers and see from where we start (ii) to train them so we can get domain-aware models (the race to fully robust generic models is still going on).
(to be continued)