User Tools

Site Tools


03_processing:01_cleaning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
03_processing:01_cleaning [2022/01/04 10:14] – [SMS removed] Simone Ueberwasser03_processing:01_cleaning [2022/06/27 09:21] (current) – external edit 127.0.0.1
Line 15: Line 15:
   - Duplicate SMS: For technical reasons, we received two copies of some SMS that the users sent only once. Thus, if a couplet of SMS featured exactly the same text and exactly the same time stamp, one of them was deleted. There are also SMS in the corpus with exactly or nearly the same text as another one but with different time stamps. We did not consider these to be duplicates owed to technical reasons and thus did not remove them. The reasons for not removing them are the following. On the one hand, they were sent to us by people who participated in our campaign and we would have shown a lack of respect if we had deleted them. It can be assumed that these identical or nearly identical SMS were also sent to the intended recipient, thus, they are real data which is valuable to us. On the other hand, it would have been difficult to actually draw a line. If one full stop is replaced by an exclamation point in the second SMS, is it still the same one? And what if a typing error is corrected in the second SMS while the rest stays the same, is that another SMS or the same? We did not want to make decisions on those questions and thus valued and kept all the data that were being sent to us.   - Duplicate SMS: For technical reasons, we received two copies of some SMS that the users sent only once. Thus, if a couplet of SMS featured exactly the same text and exactly the same time stamp, one of them was deleted. There are also SMS in the corpus with exactly or nearly the same text as another one but with different time stamps. We did not consider these to be duplicates owed to technical reasons and thus did not remove them. The reasons for not removing them are the following. On the one hand, they were sent to us by people who participated in our campaign and we would have shown a lack of respect if we had deleted them. It can be assumed that these identical or nearly identical SMS were also sent to the intended recipient, thus, they are real data which is valuable to us. On the other hand, it would have been difficult to actually draw a line. If one full stop is replaced by an exclamation point in the second SMS, is it still the same one? And what if a typing error is corrected in the second SMS while the rest stays the same, is that another SMS or the same? We did not want to make decisions on those questions and thus valued and kept all the data that were being sent to us.
   - SMS, that were obviously not sent by human beings but by computers: we found some SMS, that have to be considered as automated SMS that were not sent but received by the people who sent in SMS. Among those are SMS that inform about an incoming MMS or SMS that were created by digital agendas. In the latter, we found texts like "9am: dentist. 10am: meeting with Bob". Another type of automated SMS were those that informed about general events, such as "tomorrow: special waste collection in your district". All these SMS do not feature the type of language we are interested in, i.e. a language that is used in SMS to communicate between human beings. We thus decided to delete them.   - SMS, that were obviously not sent by human beings but by computers: we found some SMS, that have to be considered as automated SMS that were not sent but received by the people who sent in SMS. Among those are SMS that inform about an incoming MMS or SMS that were created by digital agendas. In the latter, we found texts like "9am: dentist. 10am: meeting with Bob". Another type of automated SMS were those that informed about general events, such as "tomorrow: special waste collection in your district". All these SMS do not feature the type of language we are interested in, i.e. a language that is used in SMS to communicate between human beings. We thus decided to delete them.
-All other SMS were kept exactly as we received them except for [[03_processing:01_anonymization|anonymization]].+All other SMS were kept exactly as we received them except for [[03_processing:02_anonymization|anonymization]].
03_processing/01_cleaning.1641287672.txt.gz · Last modified: 2022/06/27 09:21 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki