![]() ![]() Part-of-speech tagging and lemmatization: In order to reach older users ( > 50 years) it was necessary to additionally resort to Facebook advertisment 3. users’ sharing and liking) to recruit participants. ![]() While data collection was solely managed by the Facebook application, we relied on Facebook’s in-platform means (i.e. ![]() In addition, the application was easy to share via Facebook which helped to promote the project and to reach many potential participants. These issues have been solved by developing a Facebook application 2 that allowed for the gathering of all three sorts of data (user consent, language data, questionnaire data) at once. privacy statement), the time-saving acquisition of authentic and complete language data, and the assignment of language data to questionnaire data. 3 For details regarding the technical and strategical design of the data collection and methods of us (.)ħCollecting non-public and personal data for the DiDi corpus raised technical issues regarding Italian privacy regulations (which require user consent incl.(2016), Beißwenger (2013)), excluding the possibility to analyse discourse patterns of non-public everyday language use. Celli and Polonio (2013), Basile and Nissim (2013), Burghardt et al. We chose to collect data from Facebook as this SNS is well known in South Tyrol, hosts a wide variety of different communication settings, and is used over the whole territory by nearly all groups of the society.ĦRelated research mainly draws on public data such as public Facebook groups, Twitter or chat data (e.g. 2 Corpus ConstructionĥFor the purpose of the DiDi project, we collected language data from social networking sites (SNS) and combined it with socio-demographic data about the writers obtained from a questionnaire. (2015) which was restricted to German texts of the corpus, not taking into account the full variety of data collected for the total corpus. ![]() Hence, it presents a continuation of Frey et al. The data was enriched with linguistic annotations on thread, text and token level including language-specific part-of-speech (PoS) and lemma information, normalisation, and language identification.ĤIn this paper, we describe the corpus with respect to its multilingual characteristics and give special emphasis to the Italian part of the corpus to which we added manually corrected PoS annotations. language biography, internet usage habits, and general parameters like age, gender, level of education) of the writers. Accordingly, the collected data is multilingual, with major parts in German but with a substantial portion in Italian (100,000 of 600,000 tokens).ģThe collected multilingual CMC corpus combines Facebook status updates, comments, and private messages with socio-demographic data (e.g. Hence, we attracted speakers of both Italian and German. the invitation to participate, the privacy agreement, the project web site, and the questionnaire for socio-demographic data was published in German and Italian. However, all information regarding the project, e.g. The project initially focused on the Germanspeaking language group. 2In the regionally funded DiDi project 1, the goal was to build a South Tyrolean CMC corpus to document the current language use of residents and to analyse it socio-linguistically with a focus on age. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |