Natural Language Processing for the protection of public spaces

Can LLMs help to prevent potential threats by monitoring social networks?

Blog post prepared by VICOMTECH

Law enforcement and social media companies worldwide use early detection of crimes or attacks in social media posts with real-time monitoring. Social media make large sections of social life visible, including events ranging from terrorist recruitment to political protests, and police organisations are taking advantage of that. Not only is information about these events public by default, but this information is also searchable and archived, making sites like Facebook, Twitter or other social networks optimal for investigations. 

However, it is technically challenging to handle the flow of massively voluminous, heterogeneous, unstructured content; moreover, it is impossible to read all the social media posts generated every minute and manually detect malicious information. In this case, natural language processing (NLP) can be very helpful. NLP is a part of the artificial intelligence field, involved in the automatic understanding and relevant information extraction from large volumes of texts written by humans. As the NLP methods are fast, they allow the processing of the content in real-time, so LEAs can take urgent actions to prevent the risks.

Today, the most significant advances in NLP tend to be guided by deep-learning methods based on neural networks. The most significant advantage of deep learning is that the neural network can automatically learn to understand the language from given examples. Previously, traditional machine learning (ML) algorithms in NLP relied on manually annotated data, which is costly and expensive to get, and the ML models represented words as single numbers. Hence, this method failed to capture the relationships between words. During the last 5-6 years, there has been a rise in large language models (LLMs), which marked a paradigm shift.

LLMs are neural networks pre-trained on vast amounts of data and with the novel transformer algorithm. This enables the LLM to capture long-range dependencies, providing a deeper understanding of the text's overall meaning. It means that the network learns the relationships between words and phrases in texts and can calculate the probabilities of the next word. This type of language model needs a large corpus of texts and extensive computation power to learn accurate language representation and probabilities. Now, a wide variety of pre-trained models differ in the size of the training corpus, its domain (general-purpose, biomedical, legal, scientific, etc.), and the number of parameters. The number of parameters in a language model is a crucial metric that reflects its size and potential performance. A more significant number of parameters typically indicates a more sophisticated model that can capture more patterns and relationships in language.

However, once trained, the model’s knowledge can be transferred to various domains and subtasks, such as text classification, machine translation, summarisation, and text generation, with only a few annotated examples. This makes NLP task-solving more precise, faster, and more manageable.

Development within APPRAISE: hate speech, proper names

During the APPRAISE project, we developed deep-learning modules for text classification, which can help in the detection of suspicious behaviour in social network texts. For example, one of the modules can automatically identify hate speech (anger, hatred, or other negative emotions) in social network texts; posts like that may be more likely to incite violence. This helps analysts prioritise and evaluate the severity of messages. For this purpose, we collected a corpus of examples from various sources (Twitter, forums, comments) in six languages, where all texts are annotated as containing hate speech or not. Then, we adapted a multilingual pre-trained language model (BERT) for this task so it can “understand” texts in different languages.

When collecting data, focusing on specific places, hashtags, or keywords associated with crime is essential. This strategy reduces the scope of the information acquired and brings context. For this purpose, we developed an NLP module to identify the mentioned names, organisations, and locations in various languages and classify them into predefined labels. Detecting names may seem straightforward: a simple search for a mention of the name you need in the text. However, names are ambiguous; for instance, if we find the name “Barcelona”, is this a city (location) or a football team (organisation)? In this case, the LLM learns the context in which this name occurs and disambiguates the names. The model can also handle name variations, such as “USA”, “U.S.A.”, and “United States of America”.

NLP modules within the APPRAISE project can process thousands of social media posts per second, facilitating suspicious behaviour detection, due to a mid-size LLM, which gives a good performance and speed.

References

Cheng-Han Chiang, Yung-Sung Chuang, and Hung-yi Lee. 2022. Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts, pages 8–15, Taipei. Association for Computational Linguistics.

Devlin J., Chang M-W, Lee K., and Toutanova K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Elezaj, O., Yayilgan, S. Y., Kalemi, E., Wendelberg, L., Abomhara, M., & Ahmed, J. (2020). Towards designing a knowledge graph-based framework for investigating and preventing crime on online social networks. In E-Democracy–Safeguarding Democracy and Human Rights in the Digital Age: 8th International Conference, e-Democracy 2019, Athens, Greece, December 12-13, 2019, Proceedings 8 (pp. 181-195). Springer International Publishing.

Huang, L., Liu, G., Chen, T., Yuan, H., Shi, P., & Miao, Y. (2021). Similarity-based emergency event detection in social media. Journal of safety science and resilience, 2(1), 11-19.

Merayo-Alba, S., Fidalgo, E., González-Castro, V., Alaiz-Rodríguez, R., & Velasco-Mata, J. (2019). Use of natural language processing to identify inappropriate content in text. In Hybrid Artificial Intelligent Systems: 14th International Conference, HAIS 2019, León, Spain, September 4–6, 2019, Proceedings 14 (pp. 254-263). Springer International Publishing.

Weimann, Gabriel. New Terrorism and New Media. Washington, DC: Commons Lab of the Woodrow Wilson International Center for Scholars, 2014.