Chapter 4 Antispam technology 4

Chapter 4 Antispam technology
4.1 Introduction to some antispam software (system)
(1) SpamAssassin
SpamAssassin is a well-known e-mail spam filtering program, which released under the Apache License 2.0 and is part of the Apache Foundation now.
SpamAssassin has become a banner in the anti-spam system, it uses different kinds of spam-detection techniques, for example, DNS-based and fuzzy-checksum-based spam detection, Bayesian filtering, external programs, blacklists and online databases, it uses a variety of test methods to identify spam by checking message headers and content, combining with advanced statistical methods.

The program can be integrated with the mail server to automatically filter all mail for a site. individual users can also run this program on their own mailbox. SpamAssassin can be deployed directly to a mail server to protect all users on the server. It can also be deployed separately on the user’s computer, providing individual protection for the user.

The program is highly configurable; it can still be configured to support per-user preferences if used as a system-wide filter.
SpamAssassin is easy to manage, and users don’t have to constantly update their email accounts, address books, and so on. Personalized information about these users and sites can be applied after the email has been classified and authenticated by SpamAssassin and without affecting the authentication effect.

These excellent features of SpamAssassin give it an advantage over other anti-spam mails. As a result, SpamAssassin has been widely used in all aspects of mail management. It is used on all sorts of operating systems, mail servers and clients, SpamAssassin’s users include mail service providers, business companies, non-profit and education organizations, and end users. SpamAssassin is also used in many commercial antispam systems on the market.

(2) Barracuda
Barracuda spam firewall is a perfect solution integrating software and hardware together, providing enterprises with the following protection: anti-spam; Anti-virus; Anti-fishing techniques; Anti-spyware; Reject server attacks.

Ten layer protection technology are adopted in Barracuda program:
Denial of service attack protection;
IP blocking list;
Speed control;
1 layer virus filtration;
2 layers of virus filtering;
User-defined rules;
Spam fingerprint filtering;
Spam intention analysis;
Bayesian database analysis;
Rule –based scoring system.

The ten-layer filtering architecture of Barracuda optimizes the processing of each email. The program handles more than a million emails a day, compared with software, Barracuda can greatly reduce the load of virus and mail server.
In addition, Barracuda also includes accessory scanning, virus filtering, ratio control and other technologies to ensure that incoming mail is legal.
Barracuda is extremely simple to install and usually takes only 5 minutes because you don’t need to install the software or make any changes to the existing mail system,

Once the installation is completed, the system administrator uses an easy-to-learn Web management interface for monitoring and maintenance. Product maintenance is also very simple, barracuda operating center could automatically update products through advanced technology. At this center, engineers monitor the Internet from morning to night to check the latest virus and spam messages, trying to protect spam as much as possible without increasing costs.
Barracuda series are suitable for various types of enterprises and institutions, ranging from small to large, and it is economically affordable. The product does not charge for per user, which makes it the most cost-effective enterprise-class spam and virus protection solution today.

4.2 Whitelists and blacklists
Any message sent by a sender on a whitelist is considered as legitimate mails. If a spam filter keeps a whitelist, mail from the listed domains, email addresses, or IP address will be allowed at any time.
The whitelist is much smaller than the blacklist, so it’s very easy to maintain. Mail flagged by a whitelist can bypass spam filters, effectively reducing the load on those filters. However white list has some shortcomings, to block a high percentage of spam, email filters must be updated constantly, because spammers may create new email addresses to email from or new keywords to use in their email which allows the email to slip through.

Blacklists, also known as realtime blackhole lists (RBL) or domain name system black lists (DNSBL), spammers may bypass blacklists by using zombie networks, because a zombie network include many different computers, all of them might from different domains, a blacklist on a specific domain just could provide limited spam protection.
Blacklists are usually easy to implement and have a low CPU overhead as blacklists only require DNS lookups, blacklists also allow spam to be blocked at the SMTP connection stage, effectively preventing it from entering to the network.
blacklists are maintained by an external entity, and these lists could be removed potentially at anytime without warning, this is one of the drawback of blacklist. the people who manage blacklists is important because they determine the effectiveness of a blacklist, if a blacklist is not updated timely, spam may get through.
Another problem of blacklists is that as the amount of spam increases, the number of DNS lookups to check blacklists increases. This is not good for mail servers which do not use single blacklist.
A study conducted at the MIT Computer Science and Artificial

4.3 Challenge/Response technology
Messages from addresses not on the whitelist initiate an automatic challenge to the sender, requiring them to prove that they are a real user instead of an automated one. For example, the sender may be required to click on a link in the reply message and enter a valid email address and the ID number of the response message. If this process is completed, then the email successfully passes through the challenge/response system (Pfleeger, S. L. ; Bloom 2005).??
This system also prevent spammers who send email manually, because the time required to do the challenge could be better used sending spam to additional addresses.

The drawback of whitelists is the deadlock problem. If two entities who have never communicated before run challenge/response systems, the challenge sent by the recipient’s system will be caught by the sender’s challenge/response system and both entities will not have the opportunity to provide a suitable response. If the original sender could adds the recipient’s address to their whitelist before communication, the deadlock problem could be alleviated.

4.4 Open Relay and open proxy
As mentioned earlier, the Open Relay function in the SMTP protocol is one of the most dangerous security flaws in E-mail systems. This function allows messages to be forwarded at will, which makes the sender completely untraceable, and neither technical nor legal means can work. Therefore, closing the open relay function of the mail server is the basis of the whole anti-spam system. In the early days, the network infrastructure was poor, the sending and receiving server could not be connected directly, so it was necessary to achieve delivery through Open Relay.
Now the network infrastructure construction is comparatively perfect, the sender and the receiver server rarely can’t directly connected, the normal operation of the mail system will no longer need to set the open forwarding function.
In addition, it is necessary to turn off the open relay function to maintain the security of the mail server. If spammers use the mail server with open relay to spread spam, it will bring inconvenience to others and increase the burden of server work. The waste of network resources may even result in the server being blacklisted by anti-spam organizations. Therefore, in order to maintain server security, the open forwarding on the mail server should also be shutdown.

A proxy is a service on a network that transmits user requests for access to other network servers. The user first connects to the proxy server, sending an access request, and then the proxy server directly accesses the corresponding server to get the resources that the user wants to access.
The proxy server enables users to access resources that cannot be accessed directly, it can also control or filter user access. The latter is a commonly used method in network firewalls, the former is often used between two networks that cannot be connected directly.

Typically the proxy server has some restrictions, for example, the proxy server can only access certain sites or services, or only authorized users can use it. However, Open Proxy refers to proxy that neither restrict the network resources that can be accessed in transit nor verify the user’s identity. This is often caused by incorrect configuration. Such an open proxy server, like an open relay, can be used to hide the real identity, which is a major obstacle for anti-spam technology.
It also causes unnecessary resource wasted to companies that deploy proxy servers. At present, many anti-spam sites with open proxy have also been blocked.

4.5 Problems and future challenges about antispam technologies
The application of anti-spam technology may also cause a series of social problems. For example, the blocking of anti-spam list to an ISP may cause a large area of communication interruption, which may cause greater inconvenience to users and may bring some disputes.
In addition, the users’ privacy issue may also appear in the anti-spam system, which needs to be paid enough attention at the same time.

Data stored in computers or leaked personal information are generally one-sided but not complete (or systematic). People who want to collect and process the data, may reassemble the data by software of the computer, however the recombination result may be quite different from the real information, even totally wrong, and the use of such data tends to cause harm to the involved parties. No matter the content– based filtering technology or a probability analysis tool such as Bayesian analysis, it is necessary to conduct a complete inspection of the content of the email, that is to say, it is necessary to read the user’s email content from beginning to end. Although this is not done for the purpose of collecting users’ privacy and information, but as far as the process itself is concerned, this process is indeed an infringement on users’ privacy rights.

Many anti-spam systems move messages that cannot be identified as spam or normal mail to a staging area, these messages are handled manually by the administrator. However most of the time the administrator determines whether the mail is spam based on the content of the mail, which means the administrator often needs to check the content of the mail when handling it manually. Obviously, this is also an infringement on users’ privacy rights.

From the industry perspective, on the one hand, users should be given the “right to be informed”, so as to ensure that users understand the privacy problems and risks caused by the adoption of anti-spam technology, and then let users decide whether to adopt anti-spam technology by themselves.
On the other hand, the management and limitation of technology should be strengthened to reduce unnecessary information collection and preservation, strengthen the security protection of user information and prevent safety incidents. At the same time, human intervention in the system should be minimized.

Users should be aware of that email is not a security information service, therefore, sending important information through email should be avoided, especially some vital information such as bank account numbers and passwords, if you have to do so, try to use trusted encryption to encrypt important information before sending it by email. In addition, users must realize that anti-spam technology is necessary to prevent the proliferation of spam to some extent, and there is no meaningful or deliberately infringement of personal privacy right, so it could be used safely.

The future challenge of spam filtering technology:
More and more spam are designed to bypass the spam filters, which also have forced the anti-spam technology to face the new situation, this forcing the spam filtering technology to be constantly updated.
Laws and regulations against spam need to be launched at the same, I think this is a relatively perfect solution to block malicious junk mails.

Spam filtering technology have experienced a quite big enhancement under the continuous development of science and technology, however there is still a high error of spam judgment, which may bring trouble to some users, thus the anti-spam technology is still a hot topic in modern society . We need to combine the current filtering technologies together, and multi-layer filtering approach should be adopted from server side, gateway and client.
The mail server should avoid open forwarding, providing service refer to the blacklist, supporting filter based on the keywords, new sources and target addresses, ensure stability and real-time arrival of normal mail.

The gateway should adopt email filtering system based on hardware, equipment are placed between routers and servers, scanning the incoming email and trying to block spam out of the network, this could not only ensure the bandwidth, but reduce the burden on the server.

The client side is the most important part of spam filtering process, in order to block spam completely, filtering must be strengthened in the client side, because it is the last defense line against spam.
Future client email filter should possess user personalization features, it should be able to capture new spam samples automatically, analysis and establish the spam dataset based on user personalization features according to the spam sample, when filtering errors occur, it could patch the vulnerabilities automatically or manually, and could be able to manage spam effectively , including suspicious mail reading, deleting , and forwarding mail to the fixed administrator.

Chapter 5 Bayesian mail filtering technology
The most popular techniques used to reduce spam nowadays include White and Black listing, address management, collaborative filtering, digital signatures, etc. However Content-based filtering (and in particular, Bayesian filtering) is the most used method, it plays an important role in spam reducing. Each messaged is searched for spam features, like indicative words (e.g. “free”), wired distribution of punctuation marks and capital letters (e.g. “SALE!!!!!”), etc.

Content-based spam filters can be built manually, by hand-engineering the set of attributes that define spam messages. These are often called heuristic filters 31, and some popular filters like Spam Assassin have been based on this idea for years. Content based filters can also be built by using Machine Learning techniques applied to a set of pre-classified messages.32

These so-called Bayesian filters are very accurate according to recent statistics.22Bayesian filters automatically 33 induce or learn a spam classifier from a set of manually classified examples of spam and legitimate (or ham) messages (the training collection). The learning process takes as input the training collection, and consists of the following steps:34
• Pre-processing.
• Tokenization.
• Representation.
• Selection.
• Learning.
Each new target message is pre-processed, tokenized, represented and feed into the classifier, in order to take a classification decision on it (whether it is spam or not).
Current methods in Bayesian filter development are focused on the first steps, given that the quality of representation has big impact on the accuracy of the learned model. It is noteworthy that some researchers have developed highly accurate filters by employing character-level tokenization, putting nearly all the intelligence of the filter in the learning method (a for of text compression) 35

5.1 Introduction to Naive Bayeisan mail filtering technology
Naïve Bayesian classification algorithms are often used in spam filtering area, Sahami (1998), from Stanford University; Paulgraham (2000), David Mertz (2002), their experiments showed naïve Bayesian classification algorithm works well for spam filtering. The basic idea of spam filtering based on Bayesian theory is that the e-mail is classified by the frequency of the characteristic words appearing in different e-mails by extracting the characteristic of the e-mail content.
The Bayesian principle distinguishes spam by using the method of probability identification, it has the advantage of not only judge a single email, but calculating the probability of each feature in junk mail, using Bayesian principle to calculate the probability of spam, and then determine whether the message is spam.

From Bayesian theorem and the theorem of the total probability,

A spam filter model based on Bayesian method
Each mail sample is described mathematically using vector space, ?d=( x1,w1 ;?; xn, wn), xi is the characteristic term for mail selection, wi is the weight of xi, The value of wi is 0 or 1, the task of the mail filter is to calculate the probability of spam and normal mail.
The category of mail C?{Spam?Ham}, Ham means legitimate email, C is the category of the processing email, according to Bayesian theory, the spam probability is calculated as follows:

When is bigger than , or the final quotient is greater than the specified threshold, the email can be determined as spam, means the prior probability of selecting spam in the sample, means The probability of all simultaneously occurring features (x1,x2,…,xn) in a spam.
Since there are only two types of mail, junk mail and legal mail, the direct use of naive Bayesian algorithm has a large classification deviation. In order to avoid wrong judgment, when an email is determined to be spam, the following conditions need to be met:

Where , so we can also get , , Sahami et al. set the threshold t to 0.999 (?= 999 ); i.e. blocking a legitimate message is as bad as letting 999 junk mails pass the filter.

5.2 Multinomial term frequency Naive Bayes
The factor of term frequency is considered in this method, each appearing feature could be regarded as an event, the mail sample is viewed as a collection of these separate events. ?P(dx|Cj ) is in accordance with polynomial distribution, refers to the probability of all features(x1,x2,…,xn)occurring simultaneously in a known category of mail.
The total number of all the words in text dx is |d|, the category of mail C?{Spam?Ham}, |d| is not affected by category C , this is a simplistic assumption, because the probability of receiving a longer normal email is actually much higher than the probability of receiving the same length of spam, m is the number of characteristic words, Nti represents the total number of times ti appears in the mail sample dx, then we can get:

In the spam filtering situation, we can get:

Where Ms is the total number of characteristic words in all spam text, Mti,s is the number of times the characteristic term ti appears in all spam text,
in order to avoid the denominator being zero, Laplace Smoothing is needed at this time, for binary classify, the value of k is 2.

we can also know P(dx| C=Ham),P(ti| C=Ham) in the same way.

According to total probability formula,

where

We can know P(C=Ham |dx ) in the same way, when P(C=Spam |dx ) is bigger than P(C=Ham |dx ), or the final quotient is greater than the specified threshold, the email can be determined as spam.

5.3 Multinomial Boolean Naive Bayes
This method is similar to term-frequency- based multinomial term frequency Naive Ba Bayesian method, both of them contain conditional probability P(ti|C), however the attribute-value is Boolean type in this model. The calculation of conditional probability is not needed in this model, at this time:

The use of Laplace Smoothing is also different in this model.
The Bayesian algorithm based on Boolean attribute performs better than the Bayesian method based on word frequency when it given less information about feature correlation.36
It has been proved that Bayesian method based on the word frequency attribute is equivalent to Bayesian based on the Poisson distribution attribute under the assumption that the mail length is independent of the mail type, therefore, when the word frequency attribute does not conform to Poisson distribution, the Bayesian performance based on Boolean attribute is better than that based on word frequency attribute.

5.4 Multivariate Bernoulli Naive Bayes
Suppose the eigenvector of email F=(t1,…,tm),ti is a mark given to each feature, this method treat each mail sample as a de-reprocessed collection of marks X=(t1,…,tm), each ti is a binary variable, the value is 0 or 1, when the value is 0, means feature ti does not present in sample d.
Judging that a mail sample d belongs to category C can be regarded as the result of m times Bernoulli experiment.
For a probability value P(ti Ic) of a Bernoulli experiment i, an additional hypothesis is made based on multivariate Bernoulli Bayesian algorithm, the results of each Bernoulli experiment are independent of each other.
This is an idealized assumption because the word appearance time is not independent of the classification.
Similar assumptions are applied in all Bayesian models. Although these assumptions are too simple in most cases, these Bayesian algorithms still perform very well in many classification tasks, namely:

Laplace Smoothing is also needed at this time, Mt,c is the number of texts belonging to category c with feature ti, and k is the number of categories.
For binary variable, the value of k is 2.
In order to prevent underflow, in actual use, the probability value is generally in logarithmic form. The mail sample classification standard T=0, we can determine an email as spam by following formula:

For more theoretical explanation can be found from the studies of Metsis et al. and Losada and Azzopardi.37

5.5 The advantages and limitations of Bayesian methods
advantages
(1) Bayesian classification method increments as it constantly receives single message, it can adapt to the evolution of spam forms. No matter how much the content of the spam changes, Bayesian classification method can collect the characteristics of the recently received spam under the guidance of users and filter it effectively.3839
(2)Bayesian classification algorithm only needs to store the number of words, rather than actual mail. As a result, less storage is needed and the resulting data could be shared between users without regard to the privacy of the mail.
(3)Superior to other algorithms in efficiency. Bayes classification algorithm scans all training samples once, then count the number of times each word appears in normal mail and spam, after that you only need to query each Token once more, and finally multiply or add each Token, however SVM , Boosting and genetic requires scanning the training sample for many times.
(4) Bayesian classification method is suitable for personal filter.
Each user can customize the filter to make it more efficient, customize the filtering precision, and define the content of feature selection.

Limitations:
(1)Emails for academic and research cannot represent general emails. Spam samples from different channels, and even from different times of the day, are quite different from normal mail samples.
For example, if Ling-spam is a communication between linguistics, then linguistic terms must account for a large proportion of its feature vectors.
Time is also important, for example the amount of junk mail collected during the New Year may include information “Happy New Year”, this could result in the keyword “Happy New Year” becomes a spam feature.

(2)The current study on bayesian methods for filtering emails focuses on the processing of English emails. The applicability of this method to other languages remains to be studied.
In English, words with independent meanings are naturally separated by spaces and are easy to recognize for statistics. For some languages such as Chinese, there are no natural intervals between the words, and bayesian classifications cannot be applied directly at this case.

(3) As mentioned before, with the development of spam filtering technology, spammers are also constantly creating spam which can bypass the filter. The essence of bayesian method is to distinguish spam based on the statistics of the features in the mail. Spammers may fill their created spam with the usual terms of normal mail. That is to say , reducing the frequency of spam words, and increasing the frequency of common words in normal emails to cheat the users ?.
This makes it difficult for bayesian methods to identify the message. In addition, separating a spam feature from an HTML tag can also bypass filtering.

5.7 The future of spam
In addition to methods such as closing Open Relay, spam filtering and establishing relevant laws, here are other methods and technologies are under discussion now:
(1) opt-out lists. This is to establish a database where people who do not want to receive spam, these people can add their email addresses to the database. Database maintainers require spammers to remove people from the mailing list. In 1997, several such databases were established around the world. However, the result was not good, and people even suspect that these databases are maintained by spammers who want to collect email addresses.

(2) Channels. Put mail with business contacts and known mail addresses in a special mailbox. Reject emails from unknown senders or put them in a separate mailbox. An AT&T researcher developed a system. Users give each contact a unique E-mail address to ensure the independence of communication channels. Users can discard one channel if necessary without affecting other channels. A similar system is said to have been installed in Lucent Personal Web Assistant.

(3) Payments. Someone proposed to establish a system whereby users can demand payment from adverse parties when they are reading emails, and the Payments can be made in electronic currency. This increases the burden of sending spam thus it may be controlled effectively.

(4) Fee Restructuring. One solution under discussion is Restructuring the current Fee structure for online services. The ISP that sends a lot of mail pays the ISP fee to receive it. This can force these ISPs to try to make charge or otherwise control users who use their networks to send large amounts of spam.

However each approach has its own advantages and disadvantages, and there is not a perfect solution at the moment. Addressing spam problem will definitely take a lot of effort. Spam is not only a technical problem, but also a social problem, because it may include problem like stealing other people’s personal resources, spreading junk information towards the whole internet and so on.
Major ISPs shall do a good job in security management area, take a responsible attitude to network security events (not only to spam), respond positively instead of saying “It is not under our control… “to answer a complaint.

Education training and publicity are also crucial, in computer network security area, there is still a considerable gap between the undeveloped countries and developed countries, an important reason is that network safety awareness is not enough in some countries. Problems such as Open Relay are no longer technical difficulties today, the security level is highly related to domestic cybersecurity management and good supervision and thorough rectification mechanism.
Other vulnerabilities such as EXPN and VRFY are also quite common. The level of network administrator in some place may not be uniform, education training about network security may be needed at this time. In addition, it is necessary to strengthen publicity, popularize spam knowledge, and force manufacturers to standardize their behaviors under the pressure of public opinion.

When the legal mechanism of privacy protection is not perfect, the only way to solve the privacy problem is to rely on industry self-discipline and the improvement of users’ self-protection awareness.

Conclusion
The current form of spam is still text form. Many of the filtering techniques are based on text categorization methods and there is no technique can claim to provide an ideal solution with 0% false positive and 0% false negative. 40 This article introduces the knowledge of spam filtering technology, analyzes the relationship and difference between mail filtering and text classification. The use of traditional Bayesian algorithm in spam filtering is analyzed in detail, including two models of Bayesian algorithm. After Studying several spam filtering technologies we can know that spam filtering is a long-term struggle, because as we are trying to analyze and study spam, spammers are also struggling to design e-mails that can interfere with the spam filter, so on the one hand it is important to restore the content of the spam, on the other hand, spam filters need to be continually updated to accommodate new forms of spam.