We propose the end-to-end multimodal fact-checking and explanation generation, where the input is a claim and a large collection of web sources, including articles, images, videos, and tweets, and the goal is to assess the truthfulness of the claim by retrieving relevant evidences and predicting a truthfulness label (i.e., support, refute and not enough information), and to generate a rationalization statement to explain the reasoning and ruling process. To support this research, we construct MOCHEG, a large-scale dataset consisting of 21,184 claims where each claim is annotated with a truthfulness label and ruling statement, with 43,148 text evidences and 15,375 image evidences. To establish baseline performances on MOCHEG, we experiment with several state-of-the-art neural architectures on the three pipelined subtasks: multimodal evidence retrieval, claim verification and explanation generation, and demonstrate that the performance of the state-of-the-art end-to-end multimodal fact-checking does not provide satisfactory outcomes. To the best of our knowledge, we are the first to build the benchmark dataset and solutions for end-to-end multimodal fact-checking and explanation generation.