publications | Nitay Calderon

2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs

Nitay Calderon, Roi Reichart, and Rotem Dror

ACL, 2025

Abs Bib HTML

The "LLM-as-a-judge" paradigm employs Large Language Models (LLMs) as annotators and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure – the Alternative Annotator Test (alt-test) – that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming open-source LLMs, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
@article{calderon2025alternative, title = {The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs}, author = {Calderon, Nitay and Reichart, Roi and Dror, Rotem}, journal = {ACL}, year = {2025}, url = {https://arxiv.org/abs/2501.10970}, }
On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs

Nitay Calderon, and Roi Reichart

NAACL, 2025

Abs DOI Bib HTML

Recent advancements in NLP systems, particularly with the introduction of LLMs, have led to widespread adoption of these systems by a broad spectrum of users across various domains, impacting decision-making, the job market, society, and scientific research. This surge in usage has led to an explosion in NLP model interpretability and analysis research, accompanied by numerous technical surveys. Yet, these surveys often overlook the needs and perspectives of explanation stakeholders. In this paper, we address three fundamental questions: Why do we need interpretability, what are we interpreting, and how? By exploring these questions, we examine existing interpretability paradigms, their properties, and their relevance to different stakeholders. We further explore the practical implications of these paradigms by analyzing trends from the past decade across multiple research fields. To this end, we retrieved thousands of papers and employed an LLM to characterize them. Our analysis reveals significant disparities between NLP developers and non-developer users, as well as between research fields, underscoring the diverse needs of stakeholders. For example, explanations of internal model components are rarely used outside the NLP field. We hope this paper informs the future design, development, and application of methods that align with the objectives and requirements of various stakeholders.
@article{calderon2024behalf, author = {Calderon, Nitay and Reichart, Roi}, title = {On Behalf of the Stakeholders: Trends in {NLP} Model Interpretability in the Era of LLMs}, journal = {NAACL}, volume = {abs/2407.19200}, year = {2025}, url = {https://doi.org/10.48550/arXiv.2407.19200}, doi = {10.48550/ARXIV.2407.19200}, eprinttype = {arXiv}, eprint = {2407.19200}, timestamp = {Sat, 24 Aug 2024 12:32:24 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2407-19200.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
NL-Eye: Abductive NLI for Images

Mor Ventura, Michael Toker, Nitay Calderon, Zorik Gekhman, Yonatan Bitton, and Roi Reichart

ICLR, 2025

Abs Bib HTML

Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs’ visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.
@article{ventura2024nleye, title = {NL-Eye: Abductive NLI for Images}, author = {Ventura, Mor and Toker, Michael and Calderon, Nitay and Gekhman, Zorik and Bitton, Yonatan and Reichart, Roi}, journal = {ICLR}, year = {2025}, url = {https://arxiv.org/abs/2410.02613}, }
AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation

Itay Nakash^*, Nitay Calderon^*, Eyal Ben David, Elad Hoffer, and Roi Reichart

arXiv preprint arXiv:2503.19693, 2025

Abs Bib HTML

Large Language Models (LLMs) have shown impressive versatility as general purpose models. However, their broad applicability comes at a high-cost computational overhead, particularly in auto-regressive decoding where each step requires a forward pass. In domain-specific settings, general-purpose capabilities are unnecessary and can be exchanged for efficiency. In this work, we take a novel perspective on domain adaptation, reducing latency and computational costs by adapting the vocabulary to focused domains of interest. We introduce AdaptiVocab, an end-to-end approach for vocabulary adaptation, designed to enhance LLM efficiency in low-resource domains. AdaptiVocab can be applied to any tokenizer and architecture, modifying the vocabulary by replacing tokens with domain-specific n-gram-based tokens, thereby reducing the number of tokens required for both input processing and output generation. AdaptiVocab initializes new n-token embeddings using an exponentially weighted combination of existing embeddings and employs a lightweight fine-tuning phase that can be efficiently performed on a single GPU. We evaluate two 7B LLMs across three niche domains, assessing efficiency, generation quality, and end-task performance. Our results show that AdaptiVocab reduces token usage by over 25% without compromising performance.
@article{nakash2025adaptivocab, title = {AdaptiVocab: Enhancing LLM Efficiency in Focused Domains through Lightweight Vocabulary Adaptation}, author = {Nakash, Itay and Calderon, Nitay and David, Eyal Ben and Hoffer, Elad and Reichart, Roi}, journal = {arXiv preprint arXiv:2503.19693}, year = {2025}, url = {https://arxiv.org/abs/2503.19693}, }
Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness

Lotem Peled-Cohen, Maya Zadok, Nitay Calderon, Hila Gonen, and Roi Reichart

arXiv preprint arXiv:2505.13418, 2025

Abs Bib HTML

Cognitive decline often surfaces in language years before diagnosis. It is frequently non-experts, such as those closest to the patient, who first sense a change and raise concern. As LLMs become integrated into daily communication and used over prolonged periods, it may even be an LLM that notices something is off. But what exactly do they notice–and should be noticing–when making that judgment? This paper investigates how dementia is perceived through language by non-experts. We presented transcribed picture descriptions to non-expert humans and LLMs, asking them to intuitively judge whether each text was produced by someone healthy or with dementia. We introduce an explainable method that uses LLMs to extract high-level, expert-guided features representing these picture descriptions, and use logistic regression to model human and LLM perceptions and compare with clinical diagnoses. Our analysis reveals that human perception of dementia is inconsistent and relies on a narrow, and sometimes misleading, set of cues. LLMs, by contrast, draw on a richer, more nuanced feature set that aligns more closely with clinical patterns. Still, both groups show a tendency toward false negatives, frequently overlooking dementia cases. Through our interpretable framework and the insights it provides, we hope to help non-experts better recognize the linguistic signs that matter.
@article{peled2025dementia, title = {Dementia Through Different Eyes: Explainable Modeling of Human and LLM Perceptions for Early Awareness}, author = {Peled-Cohen, Lotem and Zadok, Maya and Calderon, Nitay and Gonen, Hila and Reichart, Roi}, journal = {arXiv preprint arXiv:2505.13418}, year = {2025}, url = {https://arxiv.org/abs/2505.13418}, }
Multi-Domain Explainability of Preferences

Nitay Calderon, Liat Ein-Dor, and Roi Reichart

arXiv preprint arXiv:2505.20088, 2025

Abs Bib HTML

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference prediction performance, outperforming baselines while also being explainable. Additionally, we assess explanations in two application-driven settings. First, guiding LLM outputs with concepts from LaaJ explanations yields responses that those judges consistently prefer. Second, prompting LaaJs with concepts explaining humans improves their preference predictions. Together, our work establishes a new paradigm for explainability in the era of LLMs.
@article{calderon2025multi, title = {Multi-Domain Explainability of Preferences}, author = {Calderon, Nitay and Ein-Dor, Liat and Reichart, Roi}, journal = {arXiv preprint arXiv:2505.20088}, year = {2025}, url = {https://arxiv.org/abs/2505.20088}, }

2024

Measuring the Robustness of NLP Models to Domain Shifts

Nitay Calderon^*, Naveh Porat^*, Eyal Ben-David, Alexander Chapanin, Zorik Gekhman, Nadav Oved, Vitaly Shalumov, and Roi Reichart

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

Abs Bib HTML

Existing research on Domain Robustness (DR) suffers from disparate setups, limited task variety, and scarce research on recent capabilities such as in-context learning. Furthermore, the common practice of measuring DR might not be fully accurate. Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation. However, we argue that the Target Drop (TD), which measures degradation from the target in-domain performance, should be used as a complementary point of view. To address these issues, we first curated a DR benchmark comprised of 7 diverse NLP tasks, which enabled us to measure both the SD and the TD. We then conducted a comprehensive large-scale DR study involving over 14,000 domain shifts across 21 fine-tuned models and few-shot LLMs. We found that both model types suffer from drops upon domain shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass them cross-domain, showing better robustness. In addition, we found that a large SD can often be explained by shifting to a harder domain rather than by a genuine DR challenge, and this highlights the importance of TD as a complementary metric. We hope our study will shed light on the current DR state of NLP models and promote improved evaluation practices toward more robust models.
@inproceedings{calderon-etal-2024-measuring, title = {Measuring the Robustness of {NLP} Models to Domain Shifts}, author = {Calderon, Nitay and Porat, Naveh and Ben-David, Eyal and Chapanin, Alexander and Gekhman, Zorik and Oved, Nadav and Shalumov, Vitaly and Reichart, Roi}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-emnlp.7}, pages = {126--154}, }

Proceedings of the 1st Workshop on NLP for Science (NLP4Science)

Nov 2024

@proceedings{nlp4science-2024-1,
  title = {Proceedings of the 1st Workshop on NLP for Science (NLP4Science)},
  editor = {Peled-Cohen, Lotem and Calderon, Nitay and Lissak, Shir and Reichart, Roi},
  month = nov,
  year = {2024},
  address = {Miami, FL, USA},
  publisher = {Association for Computational Linguistics},
  url = {https://aclanthology.org/2024.nlp4science-1.0},
}

The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth

Shir Lissak^*, Nitay Calderon^*, Geva Shenkman, Yaakov Ophir, Eyal Fruchter, Anat Brunstein Klomek, and Roi Reichart

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Jun 2024

Abs DOI Bib HTML

Queer youth face increased mental health risks, such as depression, anxiety, and suicidal ideation. Hindered by negative stigma, they often avoid seeking help and rely on online resources, which may provide incompatible information. Although access to a supportive environment and reliable information is invaluable, many queer youth worldwide have no access to such support. However, this could soon change due to the rapid adoption of Large Language Models (LLMs) such as ChatGPT. This paper aims to comprehensively explore the potential of LLMs to revolutionize emotional support for queers. To this end, we conduct a qualitative and quantitative analysis of LLM’s interactions with queer-related content. To evaluate response quality, we develop a novel ten-question scale that is inspired by psychological standards and expert input. We apply this scale to score several LLMs and human comments to posts where queer youth seek advice and share experiences. We find that LLM responses are supportive and inclusive, outscoring humans. However, they tend to be generic, not empathetic enough, and lack personalization, resulting in nonreliable and potentially harmful advice. We discuss these challenges, demonstrate that a dedicated prompt can improve the performance, and propose a blueprint of an LLM-supporter that actively (but sensitively) seeks user context to provide personalized, empathetic, and reliable responses. Our annotated dataset is available for further research.*https://github.com/nitaytech/LGBTeenDataset
@inproceedings{lissak-etal-2024-colorful, title = {The Colorful Future of {LLM}s: Evaluating and Improving {LLM}s as Emotional Supporters for Queer Youth}, author = {Lissak, Shir and Calderon, Nitay and Shenkman, Geva and Ophir, Yaakov and Fruchter, Eyal and Klomek, Anat Brunstein and Reichart, Roi}, editor = {Duh, Kevin and Gomez, Helena and Bethard, Steven}, booktitle = {Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)}, month = jun, year = {2024}, address = {Mexico City, Mexico}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.naacl-long.113}, doi = {10.18653/v1/2024.naacl-long.113}, pages = {2040--2079}, }
Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals

Yair Ori Gat^*, Nitay Calderon^*, Amir Feder, Alexander Chapanin, Amit Sharma, and Roi Reichart

In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, Jun 2024

Abs Bib HTML

Causal explanations of the predictions of NLP systems are essential to ensure safety and establish trust. Yet, existing methods often fall short of explaining model predictions effectively or efficiently and are often model-specific. In this paper, we address model-agnostic explanations, proposing two approaches for counterfactual (CF) approximation. The first approach is CF generation, where a large language model (LLM) is prompted to change a specific text concept while keeping confounding concepts unchanged. While this approach is demonstrated to be very effective, applying LLM at inference-time is costly. We hence present a second approach based on matching, and propose a method that is guided by an LLM at training-time and learns a dedicated embedding space. This space is faithful to a given causal graph and effectively serves to identify matches that approximate CFs. After showing theoretically that approximating CFs is required in order to construct faithful explanations, we benchmark our approaches and explain several models, including LLMs with billions of parameters. Our empirical results demonstrate the excellent performance of CF generation models as model-agnostic explainers. Moreover, our matching approach, which requires far less test-time resources, also provides effective explanations, surpassing many baselines. We also find that Top-K techniques universally improve every tested method. Finally, we showcase the potential of LLMs in constructing new benchmarks for model explanation and subsequently validate our conclusions. Our work illuminates new pathways for efficient and accurate approaches to interpreting NLP systems.
@inproceedings{gat2024faithful4, author = {Gat, Yair Ori and Calderon, Nitay and Feder, Amir and Chapanin, Alexander and Sharma, Amit and Reichart, Roi}, title = {Faithful Explanations of Black-box {NLP} Models Using LLM-generated Counterfactuals}, booktitle = {The Twelfth International Conference on Learning Representations, {ICLR} 2024, Vienna, Austria, May 7-11, 2024}, publisher = {OpenReview.net}, year = {2024}, url = {https://openreview.net/forum?id=UMfcdRIotC}, timestamp = {Wed, 07 Aug 2024 17:11:53 +0200}, biburl = {https://dblp.org/rec/conf/iclr/GatCFCSR24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance

Omer Nahum, Nitay Calderon, Orgad Keller, Idan Szpektor, and Roi Reichart

arXiv preprint arXiv:2410.18889, Jun 2024

Abs Bib HTML

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.
@article{nahum2024llms, title = {Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance}, author = {Nahum, Omer and Calderon, Nitay and Keller, Orgad and Szpektor, Idan and Reichart, Roi}, journal = {arXiv preprint arXiv:2410.18889}, year = {2024}, url = {https://arxiv.org/abs/2410.18889}, }

2023

A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training

Nitay Calderon, Subhabrata Mukherjee, Roi Reichart, and Amir Kantor

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2023

Abs DOI Bib HTML

Modern Natural Language Generation (NLG) models come with massive computational and storage requirements. In this work, we study the potential of compressing them, which is crucial for real-world applications serving millions of users. We focus on Knowledge Distillation (KD) techniques, in which a small student model learns to imitate a large teacher model, allowing to transfer knowledge from the teacher to the student. In contrast to much of the previous work, our goal is to optimize the model for a specific NLG task and a specific dataset. Typically in real-world applications, in addition to labeled data there is abundant unlabeled task-specific data, which is crucial for attaining high compression rates via KD. In this work, we conduct a systematic study of task-specific KD techniques for various NLG tasks under realistic assumptions. We discuss the special characteristics of NLG distillation and particularly the exposure bias problem. Following, we derive a family of Pseudo-Target (PT) augmentation methods, substantially extending prior work on sequence-level KD. We propose the Joint-Teaching method, which applies word-level KD to multiple PTs generated by both the teacher and the student. Finally, we validate our findings in an extreme setup with no labeled examples using GPT-4 as the teacher. Our study provides practical model design observations and demonstrates the effectiveness of PT training for task-specific KD in NLG.
@inproceedings{calderon-etal-2023-systematic, title = {A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training}, author = {Calderon, Nitay and Mukherjee, Subhabrata and Reichart, Roi and Kantor, Amir}, editor = {Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = jul, year = {2023}, address = {Toronto, Canada}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.acl-long.818}, doi = {10.18653/v1/2023.acl-long.818}, pages = {14632--14659}, }
A Picture May Be Worth a Thousand Lives: An Interpretable Artificial Intelligence Strategy for Predictions of Suicide Risk from Social Media Images

Yael Badian, Yaakov Ophir, Refael Tikochinski, Nitay Calderon, Anat Brunstein Klomek, and Roi Reichart

CoRR, Jul 2023

Abs DOI Bib HTML

The promising research on Artificial Intelligence usages in suicide prevention has principal gaps, including black box methodologies, inadequate outcome measures, and scarce research on non-verbal inputs, such as social media images (despite their popularity today, in our digital era). This study addresses these gaps and combines theory-driven and bottom-up strategies to construct a hybrid and interpretable prediction model of valid suicide risk from images. The lead hypothesis was that images contain valuable information about emotions and interpersonal relationships, two central concepts in suicide-related treatments and theories. The dataset included 177,220 images by 841 Facebook users who completed a gold-standard suicide scale. The images were represented with CLIP, a state-of-the-art algorithm, which was utilized, unconventionally, to extract predefined features that served as inputs to a simple logistic-regression prediction model (in contrast to complex neural networks). The features addressed basic and theory-driven visual elements using everyday language (e.g., bright photo, photo of sad people). The results of the hybrid model (that integrated theory-driven and bottom-up methods) indicated high prediction performance that surpassed common bottom-up algorithms, thus providing a first proof that images (alone) can be leveraged to predict validated suicide risk. Corresponding with the lead hypothesis, at-risk users had images with increased negative emotions and decreased belonginess. The results are discussed in the context of non-verbal warning signs of suicide. Notably, the study illustrates the advantages of hybrid models in such complicated tasks and provides simple and flexible prediction strategies that could be utilized to develop real-life monitoring tools of suicide.
@article{badian2023picture, author = {Badian, Yael and Ophir, Yaakov and Tikochinski, Refael and Calderon, Nitay and Klomek, Anat Brunstein and Reichart, Roi}, title = {A Picture May Be Worth a Thousand Lives: An Interpretable Artificial Intelligence Strategy for Predictions of Suicide Risk from Social Media Images}, journal = {CoRR}, volume = {abs/2302.09488}, year = {2023}, url = {https://doi.org/10.48550/arXiv.2302.09488}, doi = {10.48550/arXiv.2302.09488}, eprinttype = {arXiv}, eprint = {2302.09488}, timestamp = {Thu, 23 Feb 2023 16:02:44 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2302-09488.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
Social media images can predict suicide risk using interpretable large language-vision models

Yael Badian, Yaakov Ophir, Refael Tikochinski, Nitay Calderon, Anat Brunstein Klomek, Eyal Fruchter, and Roi Reichart

The Journal of Clinical Psychiatry, Jul 2023

Abs Bib HTML

Background: Suicide, a leading cause of death and a major public health concern, became an even more pressing matter since the emergence of social media two decades ago and, more recently, following the hardships that characterized the COVID-19 crisis. Contemporary studies therefore aim to predict signs of suicide risk from social media using highly advanced artificial intelligence (AI) methods. Indeed, these new AI-based studies managed to break a longstanding prediction ceiling in suicidology; however, they still have principal limitations that prevent their implementation in real-life settings. These include “black box” methodologies, inadequate outcome measures, and scarce research on non-verbal inputs, such as images (despite their popularity today).

Objective: This study aims to address these limitations and present an interpretable prediction model of clinically valid suicide risk from images.

Methods: The data were extracted from a larger dataset from May through June 2018 that was used to predict suicide risk from textual postings. Specifically, the extracted data included a total of 177,220 images that were uploaded by 841 Facebook users who completed a gold-standard suicide scale. The images were represented with CLIP (Contrastive Language-Image Pre-training), a state-of-the-art deep-learning algorithm, which was utilized, unconventionally, to extract predefined interpretable features (eg, “photo of sad people”) that served as inputs to a simple logistic regression model.

Results: The results of this hybrid model that integrated theory-driven features with bottom-up methods indicated high prediction performance that surpassed common deep learning algorithms (area under the receiver operating characteristic curve [AUC] = 0.720, Cohen d = 0.82). Further analyses supported a theory-driven hypothesis that at-risk users would have images with increased negative emotions and decreased belongingness.

Conclusions: This study provides a first proof that publicly available images can be leveraged to predict validated suicide risk. It also provides simple and flexible strategies that could enhance the development of real-life monitoring tools for suicide.
@article{badian2023social, title = {Social media images can predict suicide risk using interpretable large language-vision models}, author = {Badian, Yael and Ophir, Yaakov and Tikochinski, Refael and Calderon, Nitay and Klomek, Anat Brunstein and Fruchter, Eyal and Reichart, Roi}, journal = {The Journal of Clinical Psychiatry}, volume = {85}, number = {1}, pages = {50516}, year = {2023}, publisher = {Physicians Postgraduate Press, Inc.}, }

2022

DoCoGen: Domain Counterfactual Generation for Low Resource Domain Adaptation

Nitay Calderon^*, Eyal Ben-David^*, Amir Feder, and Roi Reichart

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

Abs DOI Bib HTML

Natural language processing (NLP) algorithms have become very successful, but they still struggle when applied to out-of-distribution examples. In this paper we propose a controllable generation approach in order to deal with this domain adaptation (DA) challenge. Given an input text example, our DoCoGen algorithm generates a domain-counterfactual textual example (D-con) - that is similar to the original in all aspects, including the task label, but its domain is changed to a desired one. Importantly, DoCoGen is trained using only unlabeled examples from multiple domains - no NLP task labels or parallel pairs of textual examples and their domain-counterfactuals are required. We show that DoCoGen can generate coherent counterfactuals consisting of multiple sentences. We use the D-cons generated by DoCoGen to augment a sentiment classifier and a multi-label intent classifier in 20 and 78 DA setups, respectively, where source-domain labeled data is scarce. Our model outperforms strong baselines and improves the accuracy of a state-of-the-art unsupervised DA algorithm.
@inproceedings{calderon-etal-2022-docogen, title = {{D}o{C}o{G}en: {D}omain Counterfactual Generation for Low Resource Domain Adaptation}, author = {Calderon, Nitay and Ben-David, Eyal and Feder, Amir and Reichart, Roi}, editor = {Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.acl-long.533}, doi = {10.18653/v1/2022.acl-long.533}, pages = {7727--7746}, }
A Functional Information Perspective on Model Interpretation

Itai Gat, Nitay Calderon, Roi Reichart, and Tamir Hazan

In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, May 2022

Abs Bib HTML

Contemporary predictive models are hard to interpret as their deep nets exploit numerous complex relations between input elements. This work suggests a theoretical framework for model interpretability by measuring the contribution of relevant features to the functional entropy of the network with respect to the input. We rely on the log-Sobolev inequality that bounds the functional entropy by the functional Fisher information with respect to the covariance of the data. This provides a principled way to measure the amount of information contribution of a subset of features to the decision function. Through extensive experiments, we show that our method surpasses existing interpretability sampling-based methods on various data signals such as image, text, and audio.
@inproceedings{gat2022functional, author = {Gat, Itai and Calderon, Nitay and Reichart, Roi and Hazan, Tamir}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesv{\'{a}}ri, Csaba and Niu, Gang and Sabato, Sivan}, title = {A Functional Information Perspective on Model Interpretation}, booktitle = {International Conference on Machine Learning, {ICML} 2022, 17-23 July 2022, Baltimore, Maryland, {USA}}, series = {Proceedings of Machine Learning Research}, volume = {162}, pages = {7266--7278}, publisher = {{PMLR}}, year = {2022}, url = {https://proceedings.mlr.press/v162/gat22a.html}, timestamp = {Tue, 12 Jul 2022 17:36:52 +0200}, biburl = {https://dblp.org/rec/conf/icml/GatCRH22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }

2021

From Limited Annotated Raw Material Data to Quality Production Data: A Case Study in the Milk Industry

Roee Shraga, Gil Katz, Yael Badian, Nitay Calderon, and Avigdor Gal

In CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, May 2021

Abs DOI Bib HTML

Industry 4.0 offers opportunities to combine multiple sensor data sources using IoT technologies for better utilization of raw material in production lines. A common belief that data is readily available (the big data phenomenon), is oftentimes challenged by the need to effectively acquire quality data under severe constraints. In this paper we propose a design methodology, using active learning to enhance learning capabilities, for building a model of production outcome using a constrained amount of raw material training data. The proposed methodology extends existing active learning methods to effectively solve regression-based learning problems and may serve settings where data acquisition requires excessive resources in the physical world. We further suggest a set of qualitative measures to analyze learners performance. The proposed methodology is demonstrated using an actual application in the milk industry, where milk is gathered from multiple small milk farms and brought to a dairy production plant to be processed into cottage cheese.
@inproceedings{shraga2021limited, author = {Shraga, Roee and Katz, Gil and Badian, Yael and Calderon, Nitay and Gal, Avigdor}, editor = {Demartini, Gianluca and Zuccon, Guido and Culpepper, J. Shane and Huang, Zi and Tong, Hanghang}, title = {From Limited Annotated Raw Material Data to Quality Production Data: {A} Case Study in the Milk Industry}, booktitle = {CIKM '21: The 30th {ACM} International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021}, pages = {4114--4124}, publisher = {{ACM}}, year = {2021}, url = {https://doi.org/10.1145/3459637.3481921}, doi = {10.1145/3459637.3481921}, timestamp = {Tue, 16 Aug 2022 23:04:38 +0200}, biburl = {https://dblp.org/rec/conf/cikm/ShragaKBCG21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }