Information

Simplified molecular-input line-entry system

Simplified molecular-input line-entry system


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

This may be a bit abrupt/vague question, but I didn't know how to proceed the search to get the answer.
What are the properties of a 3D biomolecule which gets ignored when a 3D biomolecule is converted to 1D representation in SMILES(Simplified molecular-input line-entry system)? two properties I found that gets neglected are linkages,interaction properties. By properties I meant isomeric count,isomeric threshold etc. Are there anymore properties that gets neglected?


A 3D model of a (bio)moelcule represents a physical 3-dimensions. For an experimental structure, each atom has a 3D coordinate (x, y, z) and, if determined by crystallography, an additional isotropic or aniosotropic B-factor (that models atom fluctuations).

A '1D' SMILES is not a physical 1-dimensional representation, and can be converted to a graph (mathematics) representing the chemical structure. The specification of SMILES CAN retain information about structural isomers, stereoisomers (cis/trans; D/L), but not conformers/rotomers, which, unlike smaller chemical structures, IS restricted to certain favorable regions for larger biomolecules.

By "linkage" if you mean chemical linkage such as a disulfide bond, then SMILES can retain such information. By "interactions", if you mean hydrogen-bonding, salt-bridges and even hydrophobic interactions etc., then SMILES does not code in such information, but neither does a 3D model explicitly -- they are implied from the 3D coordinates for each atom.

Thus, converting from a 3D model to SMILES, you would lose

  1. Unusual bond length and bond angle: they are necessarily specified from the 3D coordinates, but in SMILES (or the subsequent graph), they are taken as 'ideal'.

  2. B-factor or temperature factor

  3. Rotomers: e.g. Proteins have favored dihedral angles in both backbone and side chains, which is not specified in SMILES.

  4. "Interactions": certain intramolecular interaction that is inevitable (e.g. salicylaldehyde) is implied in both SMILES and a 3D model. Most others can only be inferred in the 3D model.


Simplified Molecular-input Line-entry System

The simplified molecular-input line-entry system or SMILES is a specification in form of a line notation for describing the structure of chemical molecules using short ASCII strings. SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules.

The original SMILES specification was developed by Arthur Weininger and David Weininger in the late 1980s. The Environmental Protection Agency funded the initial project to develop SMILES. It has since been modified and extended by others, most notably by Daylight Chemical Information Systems Inc. In 2007, an open standard called "OpenSMILES" was developed by the Blue Obelisk open-source chemistry community. Other 'linear' notations include the Wiswesser Line Notation (WLN), ROSDAL and SLN (Tripos Inc).

In July 2006, the IUPAC introduced the InChI as a standard for formula representation. SMILES is generally considered to have the advantage of being slightly more human-readable than InChI it also has a wide base of software support with extensive theoretical (e.g., graph theory) backing.

Famous quotes containing the words system and/or simplified :

&ldquo In nothing was slavery so savage and relentless as in its attempted destruction of the family instincts of the Negro race in America. Individuals, not families shelters, not homes herding, not marriages, were the cardinal sins in that system of horrors. &rdquo
&mdashFannie Barrier Williams (1855�)

&ldquo I have simplified my politics into an utter detestation of all existing governments and, as it is the shortest and most agreeable and summary feeling imaginable, the first moment of an universal republic would convert me into an advocate for single and uncontradicted despotism. The fact is, riches are power, and poverty is slavery all over the earth, and one sort of establishment is no better, nor worse, for a people than another. &rdquo
&mdashGeorge Gordon Noel Byron (1788�)


Abstract

Simplified molecular input line entry system (SMILES) descriptor based quantitative structure–activity relationship (QSAR) study was performed on a set of HIV-protease inhibitors to explore the structural functionalities for inhibition of the HIV-protease. For this purpose a set of HIV-inhibitors was collected from the literature along with their inhibitory constants. Monte Carlo optimization-based CORAL software was used for QSAR model development. Firstly, the dataset was divided into three random splits and secondly each split was divided into training, calibration, test and validation sets. A training set was used for model development whereas the rest of the sets were used to assess the quality of the developed models. QSAR models were developed with and without considering the influence of cyclic rings toward the inhibitory activity. Statistical quality of QSAR models developed from all splits was very good and fulfilled the criteria. The values of R 2 , Q 2 , s, R 2 pred and r 2 m explained that selected models are robust in nature and efficient enough to predict the inhibitory activity of the molecules outside of the training set. Statistical parameters also suggested that the presence of cyclic rings have a crucial impact on inhibitory activity. The molecular fragments were found to be important for the increase or decrease of the inhibitory activity which explained that models have mechanistic interpretation. This ligand-based QSAR study can provide clear directions to design and modulate potential HIV-protease inhibitors.


SMILES Tutorial

SMILES (Simplified Molecular Input Line Entry System) is a chemical notation that allows a user to represent a chemical structure in a way that can be used by the computer. SMILES is an easily learned and flexible notation. The SMILES notation requires that you learn a handful of rules. You do not need to worry about ambiguous representations because the software will automatically reorder your entry into a unique SMILES string when necessary.

SMILES was developed through funding from the U.S. Environmental Protection Agency, Mid-Continent Ecology Division-Duluth, (MED-Duluth) Duluth, MN to the Medicinal Chemistry Project at Pomona College, Claremont, CA and the Computer Sciences Corporation, Duluth, MN. Several publications discuss SMILES in more detail, including Anderson et al. 1987, Weininger 1988, Weininger et al. 1989, and Hunter et al., 1987.

SMILES has five basic syntax rules which must be observed. If basic rules of chemistry are not followed in SMILES entry, the system will warn the user and ask that the structure be edited or reentered. For example, if the user places too many bonds on an atom, a SMILES warning will appear that the structure is impossible. The rules are described below and some examples are provided. The rules below allow for the representation of a two-dimensional structure of a chemical. For the ASTER system, a two-dimensional depiction is adequate. Other rules are available for chemicals that are structural isomers, but will not be discussed in this basic tutorial.

Rule One: Atoms and Bonds

SMILES supports all elements in the periodic table. An atom is represented using its respective atomic symbol. Upper case letters refer to non-aromatic atoms lower case letters refer to aromatic atoms. If the atomic symbol has more than one letter the second letter must be lower case.

Bonds are denoted as shown below:

Single bonds are the default and therefore need not be entered. For example, 'CC' would mean that there is a non-aromatic carbon attached to another non-aromatic carbon by a single bond, and the computer would identify the structure as the chemical ethane. It is also assumed that the bond between two lower case atom symbols is aromatic. A blank terminates the SMILES string.

Rule Two: Simple Chains

By combining atomic symbols and bond symbols simple chain structures can be represented. The structures that are entered using SMILES are hydrogen-suppressed, that is to say that the molecules are represented without hydrogens. The SMILES software understands the number of possible connections that an atom can have. If enough bonds are not identified by the user through SMILES notation, the system will automatically assume that the other connections are satisfied by hydrogen bonds.

CC CH3CH3 Ethane
C=C CH2CH2 Ethene
CBr CH3Br Bromomethane
C#N C=N Hydrocyanic acid
Na.Cl NaCl Sodium chloride

The user can explicitly identify the hydrogen bonds, but if one hydrogen bond is identified in the string, the SMILES interpreter will assume that the user has identified all hydrogens for that molecule.

Because SMILES allows entry of all elements in the periodic table, and also utilizes hydrogen suppression, the user should be aware of chemicals with two letters that could be misinterpreted by the computer. For example, 'Sc' could be interpreted as a sulfur atom connected to an aromatic carbon by a single bond, or it could be the symbol for scandium. The SMILES interpreter gives priority to the interpretation of a single bond connecting a sulfur atom and an aromatic carbon. To identify scandium the user should enter [Sc].

A branch from a chain is specified by placing the SMILES symbol(s) for the branch between parenthesis. The string in parentheses is placed directly after the symbol for the atom to which it is connected. If it is connected by a double or triple bond, the bond symbol immediately follows the left parenthesis. Some examples:

CC(O)C 2-Propanol
CC(=O)C 2-Propanone
CC(CC)C 2-Methylbutane
CC(C)CC(=O) 2-Methylbutanal
c1c(N(=O)=O)cccc1 Nitrobenzene
CC(C)(C)CC 2,2-Dimethylbutane

SMILES allows a user to identify ring structures by using numbers to identify the opening and closing ring atom. For example, in C1CCCCC1, the first carbon has a number '1' which connects by a single bond with the last carbon which also has a number '1'. The resulting structure is cyclohexane. Chemicals that have multiple rings may be identified by using different numbers for each ring. If a double, single, or aromatic bond is used for the ring closure, the bond symbol is placed before the ring closure number. Some examples:

or C=1CCCCC1 Cyclohexene
C*1*C*C*C*C*C1
c1ccccc1 Benzene
C1OC1CC Ethyloxirane
c1cc2ccccc2cc1 Naphthalene

Rule Five: Charged Atoms

Charges on an atom can be used to override the knowledge regarding valence that is built into SMILES software. The format for identifying a charged atom consists of the atom followed by brackets which enclose the charge on the atom. The number of charges may be explicitly stated (<-1>) or not (<->). For example:

If you have questions regarding the SMILES notation, contact ECOTOX Support at -
T: (218) 529-5225 E-mail: [email protected]

Anderson, E., G.D. Veith, and D. Weininger. 1987. SMILES: A line notation and computerized interpreter for chemical structures. Report No. EPA/600/M-87/021. U.S. Environmental Protection Agency, Environmental Research Laboratory-Duluth, Duluth, MN 55804

Hunter, R.S., F.D. Culver, and A. Fitzgerald. 1987. SMILES User Manual. A Simplified Molecular Input Line Entry System. Includes extended SMILES for defining fragments. Review Draft, Internal Report, Montana State University, Institute for Biological and Chemical Process Control (IPA), Bozeman, MT.

Weininger, D. 1988. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Science 28: 31-36.


Keywords

  • APA
  • Standard
  • Harvard
  • Vancouver
  • Author
  • BIBTEX
  • RIS

Research output : Contribution to journal › Article › peer-review

T1 - Simplified molecular input line entry system-based optimal descriptors

T2 - Quantitative structure-activity relationship modeling mutagenicity of nitrated polycyclic aromatic hydrocarbons

N2 - We developed a new QSAR model, based on the optimal descriptors, calculated with simplified molecular input line entry system. These descriptors are correlated with mutagenic potential for a training set and correlated with this end-point for a test set. Statistical characteristics of the model are n = 28, r2 = 0.902, q2 = 0.892, s = 0.554, F = 240 (training set) and n = 20, r2 = 0.853, q2 = 0.823, s = 0.702, F = 105 (test set).

AB - We developed a new QSAR model, based on the optimal descriptors, calculated with simplified molecular input line entry system. These descriptors are correlated with mutagenic potential for a training set and correlated with this end-point for a test set. Statistical characteristics of the model are n = 28, r2 = 0.902, q2 = 0.892, s = 0.554, F = 240 (training set) and n = 20, r2 = 0.853, q2 = 0.823, s = 0.702, F = 105 (test set).


Definiția SMILES ca șiruri de limbaj fără context

Din punctul de vedere al teoriei limbajului formal, SMILES este un cuvânt. Un SMILES este analizat cu un analizor fără context. Utilizarea acestei reprezentări a fost în predicția proprietăților biochimice (incl. Toxicitatea și biodegradabilitatea) pe baza principiului principal al chimioterapiei conform căruia molecule similare au proprietăți similare. Modelele predictive au implementat o abordare sintactică de recunoaștere a modelelor (care a implicat definirea unei distanțe moleculare), precum și o schemă mai robustă bazată pe recunoașterea statistică a modelelor.


Simplified molecular-input line-entry system - Biology

Beam - to express by means of a radiant smile

Beam is a free toolkit dedicated to parsing and generating Simplified molecular-input line-entry system - SMILES™ line notations. The primary focus of the library is to elegantly handle the SMILES™ syntax and as fast as possible.

Note: Beam is still in a development and some APIs will likely change until a release is made.

One of the primary types in Beam is the Graph it provides convenience methods for reading SMILES™ notation directly.

and for writing it back to SMILES™ notation.

Beam provides excellent round tripping, preserving exactly how the input was specified. Disregarding inputs with redundant brackets and erroneous/repeated ring numbers - the actually input will generally be identical to the output.

Although preserving the representation was one of the design goals for beam it is common to normalise output SMILES™.

Collapse a graph with labelled hydrogens [CH3][CH2][OH] to one with implicit hydrogens CCO .

Expand a graph where the hydrogens are implicit CCO to one with labelled hydrogens [CH3][CH2][OH] .

Stereo specification is persevered through rearrangements. The example below randomly generates arbitrary SMILES™ preserving correct stereo-configuration.

Bond based double-bond configuration is normal in SMILES but can be problematic. The issue is that a single symbol may be specifying two adjacent configurations. A proposed extension was to use atom-based double-bond configuration.

Beam will input, output and convert atom and bond-based double-bond stereo specification.

Convert a graph with delocalised bonds to kekulé representation.

With bond-based double-bond stereo specification there are two possible ways to write each bond-based configuration. beam allows you to normalise the labels such that the first symbol is always a forward slash ( / ). Some examples are shown below.

beam is still in development but you can obtain the latest build from the EBI snapshots repository. An example configuration for maven is shown below.

Copyright (c) 2013, European Bioinformatics Institute (EMBL-EBI) All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE, DATA, OR PROFITS OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of the FreeBSD Project.


Atoms

Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. The hydroxide anion is [OH-]. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed for instance the SMILES for water is simply O.

Bonds

Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES. For example the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. Double and triple bonds are represented by the symbols '=' and '#' respectively as illustrated by the SMILES O=C=O (carbon dioxide) and C#N (hydrogen cyanide).

Branching

Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.

Aromaticity

Aromatic C, O, S and N atoms are shown in their lower case 'c', 'o', 's' and 'n' respectively. Benzene, pyridine and furan can be represented respectively by the SMILES c1ccccc1, n1ccccc1 and o1cccc1. Bonds between aromatic atoms are, by default, aromatic although these can be specified explicitly using the ':' symbol. Aromatic atoms can be singly bonded to each other and biphenyl can be represented by c1ccccc1-c2ccccc2. Aromatic nitrogen bonded to hydrogen, as found in pyrrole must be represented as [nH] and imidazole is written in SMILES notation as n1c[nH]cc1.

The Daylight and OpenEye algorithms for generating canonical SMILES differ in their treatment of aromaticity.

Stereochemistry

Configuration around double bonds is specified using the characters "/" and "". For example, F/C=C/F (see depiction)is one representation of trans-difluoroethene, in which the fluorine atoms are on opposite sides of the double bond, whereas F/C=CF (see depiction) is one possible representation of cis-difluoroethene, in which the Fs are on the same side of the double bond, as shown in the figure.

Configuration at tetrahedral carbon is specified by @ or @@. L-Alanine, the more common enantiomer of the amino acid alanine can be written as N[[email protected]@H](C)C(=O)O (see depiction). The @@ specifier indicates that, when viewed from nitrogen along the bond to the chiral center, the sequence of substituents hydrogen (H), methyl (C) and carboxylate (C(=O)O)appear clockwise. D-Alanine can be written as N[[email protected]](C)C(=O)O (see depiction). The order of the substituents in the SMILES string is very important and D-alanine can also be encoded as N[[email protected]@H](C(=O)O)C (see depiction).

Isotopes

Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.

Other examples of SMILES

The SMILES notation is described extensively in the SMILES theory manual provided by Daylight Chemical Information Systems and a number of illustrative examples are presented. Daylight's depict utility provides users with the means to check their own examples of SMILES and is a valuable educational tool.


Simplified molecular-input line-entry system - Biology

SMILES (Simplified Molecular Input Line Entry System) is a line notation (a typographical method using printable characters) for entering and representing molecules and reactions. Some examples are:

SMILESNameSMILESName
CC ethane [OH3+] hydronium ion
O=C=O carbon dioxide [2H]O[2H] deuterium oxide
C#N hydrogen cyanide [235U] uranium-235
CCN(CC)CC triethylamine F/C=C/F E-difluoroethene
CC(=O)O acetic acid F/C=CF Z-difluoroethene
C1CCCCC1 cyclohexane N[[email protected]@H](C)C(=O)O L-alanine
c1ccccc1 benzene N[[email protected]](C)C(=O)O D-alanine

Reaction SMILESName
[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI displacement reaction
(C(=O)O).(OCC)>>(C(=O)OCC).(O) intermolecular esterification

SMILES contains the same information as might be found in an extended connection table. The primary reason SMILES is more useful than a connection table is that it is a linguistic construct, rather than a computer data structure. SMILES is a true language, albeit with a simple vocabulary (atom and bond symbols) and only a few grammar rules. SMILES representations of structure can in turn be used as "words" in the vocabulary of other languages designed for storage of chemical information (information about chemicals) and chemical intelligence (information about chemistry).

Part of the power of SMILES is that unique SMILES exist. With standard SMILES, the name of a molecule is synonymous with its structure with unique SMILES, the name is universal. Anyone in the world who uses unique SMILES to name a molecule will choose the exact same name.

One other important property of SMILES is that it is quite compact compared to most other methods of representing structure. A typical SMILES will take 50% to 70% less space than an equivalent connection table, even binary connection tables. For example, a database of 23,137 structures, with an average of 20 atoms per structure, uses only 1.6 bytes per atom when represented with SMILES. In addition, ordinary compression of SMILES is extremely effective. The same database cited above was reduced to 27% of its original size by Ziv-Lempel compression (i.e. 0.42 bytes per atom).

  • Keys for database access
  • Mechanism for researchers to exchange chemical information
  • Entry system for chemical data
  • Part of languages for artificial intelligence or expert systems in chemistry

3.1 Canonicalization

Input SMILESUnique SMILES
OCC CCO
[CH3][CH2][OH] CCO
C-C-O CCO
C(O)C CCO
OC(=O)C(Br)(Cl)N NC(Cl)(Br)C(=O)O
ClC(Br)(N)C(=O)O NC(Cl)(Br)C(=O)O
O=C(O)C(N)(Br)Cl NC(Cl)(Br)C(=O)O

3.2 SMILES Specification Rules

There are five generic SMILES encoding rules, corresponding to specification of atoms, bonds, branches, ring closures, and disconnections. Rules for specifying various kinds of isomerism are discussed in the following section, ISOMERIC SMILES.

3.2.1 Atoms

C methane (CH4)
P phosphine (PH3)
N ammonia (NH3)
S hydrogen sulfide (H2S)
O water (H2O)
Cl hydrochloric acid (HCl)

Atoms with valences other than "normal" and elements not in the "organic subset" must be described in brackets.

[S] elemental sulfur
[Au] elemental gold

Within brackets, any attached hydrogens and formal charges must always be specified. The number of attached hydrogens is shown by the symbol H followed by an optional digit. Similarly, a formal charge is shown by one of the symbols + or -, followed by an optional digit. If unspecified, the number of attached hydrogens and charge are assumed to be zero for an atom inside brackets. Constructions of the form [Fe+++] are synonymous with the form [Fe+3]. Examples are:

[H+] proton
[Fe+2] iron (II) cation
[OH-] hydroxyl anion
[Fe++] iron (II) cation
[OH3+] hydronium cation
[NH4+] ammonium cation

3.2.2 Bonds

CC ethane (CH3CH3)
C=O formaldehyde (CH2O)
C=C ethene (CH2=CH2)
O=C=O carbon dioxide (CO2)
COC dimethyl ether (CH3OCH3)
C#N hydrogen cyanide (HCN)
CCO ethanol (CH3CH2OH)
[H][H] molecular hydrogen (H2)

For linear structures, SMILES notation corresponds to conventional diagrammatic notation except that hydrogens and single bonds are generally omitted. For example, 6-hydroxy-1,4-hexadiene can be represented by many equally valid SMILES, including the following three:

StructureValid SMILES
C=CCC=CCO
CH2=CH-CH2-CH=CH-CH2-OH C=C-C-C=C-C-O
OCC=CCC=C

3.2.3 Branches

CCN(CC)CC CC(C)C(=O)O C=CC(CCC)C(C(C)C)CCC
Triethylamine Isobutyric acid 3-propyl-4-isopropyl-1-heptene

3.2.4 Cyclic Structures

There are usually many different, but equally valid descriptions of the same structure, e.g., the following SMILES notations for 1-methyl-3-bromo-cyclohexene-1:

Many other notations may be written for the same structure, deriving from different ring closures. SMILES does not have a preferred entry on input although (a) above may be simplest, others are just as valid.

A single atom may have more than one ring closure. This is illustrated by the structure of cubane in which two atoms have more than one ring closure:

Generation of SMILES for cubane: C12C3C4C1C5C4C3C25.

If desired, digits denoting ring closures can be reused. As an example, the digit 1 used twice in the specification:

The ability to re-use ring closure digits makes it possible to specify structures with 10 or more rings. Structures that require more than 10 ring closures to be open at once are exceedingly rare. If necessary or desired, higher-numbered ring closures may be specified by prefacing a two-digit number with percent sign (%). For example, C2%13%24 is a carbon atom with a ring closures 2, 13, and 24 .

3.2.5 Disconnected Structures

Matching pairs of digits following atom specifications imply that the atoms are bonded to each other. The bond may be explicit (bond symbol and/or direction preceding the ring closure digit) or implicit (a nondirectional single or aromatic bond). This is true whether or not the bond ends up as part of a ring.

Adjacent atoms separated by dot (.) implies that the atoms are not bonded to each other. This is true whether or not the atoms are in the same connected component.

For example, C1.C1 specifies the same molecule as CC(ethane)

3.3 Isomeric SMILES

The SMILES isomer specification rules allow chirality to be completely specified for any structure, if it is known. Unlike most existing chemical nomenclatures such as CIP and IUPAC, these rules are also designed to allow rigorous partial specification of chirality. Aside from use in macros, substructure searching, and other pattern matching operations, this is important because much of the world's available chemical information is known for structures with incompletely resolved chiralities (not all possible chiral centers are separated, known, or reported).

All isomer specification rules in SMILES are therefore optional. The absence of a specification for any attribute implies that the value of that attribute is unspecified.

3.3.1 Isotopic Specification

Smiles Name
[12C] carbon-12
[13C] carbon-13
[C] carbon (unspecified mass)
[13CH4] C-13 methane

3.3.2 Configuration Around Double Bonds

An important difference between SMILES chirality conventions and others such as CIP is that SMILES uses local chirality representation (as opposed to absolute chirality), which allows partial specifications. An example of this is illustrated below:

F/C=C/C=C/C F/C=C/C=CC
(completely specified) (partially specified)

3.3.3. Configuration Around Tetrahedral Centers

The simplest and most common kind of chirality is tetrahedral four neighbor atoms are evenly arranged about a central atom, known as the "chiral center". If all four neighbors are different from each other in any way, mirror images of the structure will not be identical. The two mirror images are known as "enantiomers" and are the only two forms that a tetrahedral center can have. If two (or more) of the four neighbors are identical to each other, the central atom will not be chiral (its mirror images can be superimposed in space).

In SMILES, tetrahedral centers may be indicated by a simplified chiral specification (@ or @@) written as an atomic property following the atomic symbol of the chiral atom. If a chiral specification is not present for a chiral atom, its chirality is implicitly not specified. For instance:

NC(C)(F)C(=O)O N[[email protected]](C)(F)C(=O)O
NC(F)(C)C(=O)O N[[email protected]@](F)(C)C(=O)O
(unspecified chirality) (specified chirality)

Looking from the amino N to the chiral C (as the SMILES is written), the three other neighbors appear anticlockwise in the order that they are written in the top SMILES, N[[email protected]](C)(F)C(=O)O (methyl-C, F, carboxy-C), and clockwise in the bottom one, N[[email protected]@](F)(C)C(=O)O. The symbol "@" indicates that the following neighbors are listed anticlockwise (it is a "visual mnemonic" in that the symbol looks like an anticlockwise spiral around a central circle). "@@" indicates that the neighbors are listed clockwise (you guessed it, anti-anti-clockwise).

If the central carbon is not the very first atom in the SMILES and has an implicit hydrogen attached (it can have at most one and still be chiral), the implicit hydrogen is taken to be the first neighbor atom of the three neighbors that follow a tetrahedral specification. If the central carbon is first in the SMILES, the implicit hydrogen is taken to be the "from" atom. Hydrogens may always be written explicitly (as [H]) in which case they are treated like any other atom. In each case, the implied order is exactly as written in SMILES. Some of the valid SMILES for the alanine are:

N[[email protected]@]([H])(C)C(=O)O N[[email protected]]([H])(C)C(=O)O
N[[email protected]@H](C)C(=O)O N[[email protected]](C)C(=O)O
N[[email protected]](C(=O)O)C N[[email protected]@H](C(=O)O)C
[H][[email protected]](N)(C)C(=O)O [H][[email protected]@](N)(C)C(=O)O
[[email protected]](N)(C)C(=O)O [[email protected]@H](N)(C)C(=O)O

The chiral order of the ring closure bond is implied by the lexical order that the ring closure digit appears on the chiral atom (not in the lexical order of the "substituent" atom).

C[[email protected]]1CCCCO1
or
O1CCCC[[email protected]@H]1C

3.3.4 General Chiral Specification

The general chiral specification used in SMILES has three parts: the @ symbol, followed by a two-letter chiral class indicator, followed by a numerical chiral permutation designator. A default chiral class is assigned to each degree (number of connections) the default class for four connections is tetrahedral (TH). Most chiralities have more than two possible choices the choices are assigned from a table numerically. In most cases, the @1 designation means "anticlockwise around the axis represented by SMILES order" and @2 means "clockwise". Notations in the form "@@" and "@@@" are interpreted as "@2" and "@3" (analogous to "+++" meaning "+3"). The "@" and "@@" notations used above are shortcuts for the full specifications "@TH1" and "@TH2". In practice, full chiral specifications are not often needed.

SMILES handles the full range of chiral specification, including resolution of "reduced chirality" (where the number of enantiomers is reduced by symmetry) and "degenerate chirality" (where the center becomes non-chiral due to symmetrical substitution). As with other aspects of SMILES, the language guarantees the ability to specify exactly what is known, including partial specifications. The SMILES system will generate unique isomeric SMILES for any given specification, and substructure recognition will operate correctly on all types of chirality.

The rest of this section will be limited to discussing the following chiralities: tetrahedral, allene-like, square-planar, trigonal-bipyramidal, and octahedral. Although many more chiral classes can be handled by this system (it's table-driven), these five classes are very common in chemistry and cover most of the issues to be encountered in the remainder.

Tetrahedral. The tetrahedral class symbol is TH. This is the default chiral class for degree four. Possible values are 1 and 2. @TH1 (or just @) indicates that, looking from the first connected atom, the following three connected atoms are listed anticlockwise @TH2 (or @@) indicates clockwise.

Allene-like. The allene-like class symbol is AL. This is the default chiral class for degree 2 (the chiral center is the central atom with two double bonds). Although substituted C=C=C structures are most common, C=C=C=C=C structures are also allene-like, as are any odd number of serially double-bonded atoms. Possible values are @AL1 (or just @) and @AL2 (or @@) these are interpreted by superimposing the substituted atoms and evaluating as per tetrahedral. Hydrogens attached to substituted allene-like atoms are taken to be immediately following that atom, as shown below:

OC(Cl)=[[email protected]]=C(C)F OC=[[email protected]]=CF
OC(Cl)=[[email protected]]=C(C)F OC([H])=[[email protected]]=C([H])F

Square-planar. The square-planar class symbol is SP Possible values are @SP1, @SP2, and @SP3 this is not the default chiral class for degree four, so shorthand specifications are not allowed. Square-planar is also somewhat unusual in that the ideas of clockwise and anticlockwise do not apply.

F[[email protected]](Cl)(Br)I (SP1 lists in a "U shape")
F[[email protected]](Br)(Cl)I (SP2 lists in a "4-shape")
F[[email protected]](Cl)(I)Br (SP3 lists in a "Z shape")

Trigonal-bipyramidal. The trigonal-bipyramidal class symbol is TB. This is the default chiral class for degree five. Possible values are @TB1 to @TB20. @TB1 (or just @) indicates that, when the SMILES is listed from one axial connection to the other, the three intermediate, equatorially-connected atoms are listed anticlockwise @TB2 (or @@) indicates clockwise. This is illustrated below.

Octahedral . The octahedral class symbol is OH. This is the default chiral class for degree six. Possible values are @OH1 to @OH30. @OH1 (or just @) indicates that, when the SMILES is listed from one axial connection to the other, the four intermediate, equatorially-connected atoms are listed anticlockwise @OH2 (or @@) indicates clockwise. This is illustrated below.

3.4 SMILES Conventions

3.4.1 Hydrogens

  • Implicitly. for atoms specified without brackets, from normal valence assumptions.
  • Explicitly by count. inside brackets, by the hydrogen count supplied zero if unspecified.
  • As explicit atoms. as [H] atoms.

There is no distinction between "organic" and "inorganic" SMILES nomenclature. One may specify the number of attached hydrogens for any atom in any SMILES. For example, propane may be entered as [CH3][CH2][CH3] instead of CCC.

There are four situations where specification of explicit hydrogen specification is required:

  • charged hydrogen, i.e. a proton, [H+]
  • hydrogens connected to other hydrogens, e.g., molecular hydrogen, [H][H]
  • hydrogens connected to other than one other atom, e.g., bridging hydrogens and
  • isotopic hydrogen specifications, e.g. in heavy water, [2H]O[2H].

3.4.2 Aromaticity

The SMILES algorithm uses an extended version of Hueckel's rule to identify aromatic molecules and ions. To qualify as aromatic, all atoms in the ring must be sp 2 hybridized and the number of available "excess" p-electrons must satisfy Hueckel's 4N+2 criterion. As an example, benzene is written c1ccccc1, but an entry of C1=CC=CC=C1 - cyclohexatriene, the Kekulé form - leads to detection of aromaticity and results in an internal structural conversion to aromatic representation. Conversely, entries of c1ccc1 and c1ccccccc1 will produce the correct anti-aromatic structures for cyclobutadiene and cyclooctatetraene, C1=CC=C1 and C1=CC=CC=CC=C1. In such cases the SMILES system looks for a structure that preserves the implied sp 2 hybridization, the implied hydrogen count, and the specified formal charge, if any. Some inputs, however, may not only be incorrect but also impossible, such as c1cccc1. Here c1cccc1 cannot be converted to C1=CCC=C1 since one of the carbon atoms would be sp 3 with two attached hydrogens. In such a structure alternating single and double bond assignments cannot be made. The SMILES system will flag this as an "impossible" input. Please note that only atoms on the following list can be considered aromatic: C, N, O, P, S, As, Se, and * (wildcard). In addition, exocyclic double bonds do not break aromaticity.

C1=COC=C1 C1=CN=C[NH]C(=O)1 C1=C*=CC=C1
c1cocc1 c1cnc[nH]c(=O)1 c1c*ccc1

It is important to remember that the purpose of the SMILES aromaticity detection algorithm is for the purposes of chemical information representation only! To this end, rigorous rules are provided for determining the "aromaticity" of charged, heterocyclic, and electron-deficient ring systems. The "aromaticity" designation as used here is not intended to imply anything about the reactivity, magnetic resonance spectra, heat of formation, or odor of substances.

3.4.3 Aromatic Nitrogen Compounds

n1ccccc1 O=n1ccccc1 [O-][n+]1ccccc1 Cn1cccc1 [nH]1cccc1
Pyridine Pyridine-N-oxide Methyl and 1H-pyrrole

Note that the pyrrolyl nitrogen in 1H-pyrrole is written [nH] to distinguish this kind of nitrogen from a pyridyl-N. Alternative valid SMILES for 1H-pyrrole include [H]n1cccc1 (with explicit hydrogen) and N1C=CC=C1 (aliphatic form) all three input forms are equivalent.

3.4.4 Bonding Conventions

Given one valence model of a structure, chemical database systems such as THOR and Merlin have the ability to retrieve data about that structure even if the data were stored under a different valence model of the structure. With such systems, the choice of valence conventions is not critical to either database design nor database query.

3.4.5 Tautomers

O=c1[nH]cccc1 Oc1ncccc1
2-pyridone 2-pyridinol

3.5 Extensions for Reactions

The SMILES language is extended to handle reactions. There are two areas where SMILES is extended: distinguishing component parts of a reactions and atom maps.

Component parts of a reaction are handled by introducing the ">" character as a new separator. Any reaction must have exactly two > characters in it. ">>" is a valid reaction SMILES for an empty reaction. Each of the ">"-separated components of a reaction must be a valid molecule SMILES.

As an aside, molecule SMILES never have a ">" character. In a program, one can quickly determine if a SMILES refers to a reaction or molecule by searching for a ">" character in the string.

Reaction SMILES Grammar:

For example: C=CCBr>>C=CCI This is a valid reaction. Note that there are no agent molecules. Also note that several atoms are missing from the reaction (the product "Br" and the reactant "F").

[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI This is a more complete version of the same reaction. It has been canonicalized. It would form the root of a datatree when stored in a THOR database.

C=CCBr.[Na+].[I-]>CC(=O)C>C=CCI.[Na+].[Br-] This version of the reaction includes an agent. Note that the SMILES does not indicate how the agent participates. Whether the agent is a solvent, catalyst, or performs another function within the reaction must be stored separately as data. This SMILES could be stored in a THOR database as an absolute SMILES and would appear on the same datatree page as the previous example.

In the above example, note that the reaction is ambiguous with respect to the carbon atoms involved. One might assume that a normal Sn2 displacement is occurring. In fact, an equally reasonable allylic displacement is possible, via either an Sn1-like allyl cation. Recognize that the reaction SMILES given above do not say which carbons are which and hence do not discriminate between the two alternate mechanisms.

This case demonstrates the use and need for atom maps for reaction processing. Atom maps are used primarily to further define the overall reaction in cases where the reaction mechanism may not be evident from the reactant and product molecules. Atom maps are non-negative integer atom modifiers. They follow the ":" character within an atom expression. They must be the last modifier within the atom expression:

SMILES Atom Expression Grammar:

Atom maps are an atomic property. They can legally appear in a SMILES for any atom, whether or not it is part of a reaction. Atom with atom map labels in a molecule SMILES are considered valid the atom maps are ignored for molecule processing. Absolute and unique SMILES generated by the system for molecules never include atom maps.

Finally, there are some differences in the handling of atom maps and agent components in the unique versus absolute SMILES for reactions. Atom maps and agent components are not part of the unique SMILES specification. This is important for the THOR database, where the datatree roots are formed from the unique SMILES. The net result is that each reaction datatree may contain multiple specific reactions with different agents and atom maps.

3.5.1 Reaction Atom Maps

Atom mappings are properties of the atoms in the reaction molecules. The mappings represent equivalence classes of atoms within a reaction. In effect, the map tells the computer which atoms are the same on the reactant and products sides of a reaction. Without this map information, it is difficult to derive the reaction bond changes which occur.

Within the SMILES language, atom maps are represented as a non-negative numeric atom modifier following the ":" character (e.g. [CH3:2] is a carbon in class 2).

Within the Daylight toolkit, the atom maps are manipulated as sets of mapped atoms. The atom map class numbers which are used in SMILES do not appear in the toolkit interface to a reaction. The map class numbers in SMILES do not have any additional significance, except to associate all atoms with the same map class label to one another.

There are no requirements for completeness or uniqueness of the atom mappings. Atom mappings are independent of the connectivity and properties of the underlying molecules. This is so for several reasons: first, there are limits to the valence representation of molecules which appear when processing reactions. For example the oxygens in sodium acetate (CC(=O)[O-].[Na+]) are chemically indistinguishable, even though the valence model used in the toolkit requires that they be connected differently. Some systems (CAS, for example) recognize this equivalence in their structural representation (the tautomer bond). It is often useful to map these to the same class for reaction purposes: [CH3:1][C:2](=[O:3])[O-:3].[Na+:4]

A second case is where there is ambiguity in a reaction mechanism which one wants to express:

can undergo a cope rearrangement before reaction (which yields the same molecule graph). In effect, there are two distinct mechanisms by which the product is produced. This can be expressed as part of a reaction by: [CH2:1]=[CH:2][CH2:1][CH2:3][C:4](C)[CH2:3]

A third case is simply a lack of information about the reaction itself. It should be possible to omit some atom maps or specify partial information for sets of atoms which *might* end up in a given position in the product. It is never acceptable to force a user to make up data in order to register a reaction. One should only store exactly what is known about the reaction. Atom maps are, by definition ambiguous with respect to the underlying molecules. Atom maps do not appear in the lexical representation of a unique SMILES. They do appear in the lexical representation of an absolute SMILES.

Finally, atom maps are arbitrary class designations the values of the numbers have no meaning. The Daylight system reserves the right to change the class numbers upon canonicalization of a reaction. The system will reorder the atom map classes over the entire reaction during canonicalization. The resulting maps are guaranteed to have the same meaning as the reaction before canonicalization. Practically, the maps are renumbered as small, dense integers in canonical atom order, but this is not guaranteed. Also, during canonicalization, the atom map classes for agent atoms are removed.

3.5.2 Hydrogens

Hydrogens in reactions are handled as with molecules they are suppressed unless "special". Recall that for molecules, hydrogens are special if they are: charged, isotopic, bonded to another hydrogen, or multiply bonded. With reactions, there is an additional case which will make a hydrogen special. It is often desirable (eg. 1,5-hydride shift) to store information about the location of hydrogens as part of the atom map of a reaction. Hydrogens with a supplied atom map are considered "special" and these hydrogens are not suppressed. These mapped hydrogens appear explicitly in Absolute SMILES for reactions. Otherwise, atom-mapped hydrogens do not appear in Unique SMILES.


Rozšíření

SMARTS je liniový zápis pro specifikaci substrukturních vzorů v molekulách. I když používá mnoho stejných symbolů jako SMILES, umožňuje také specifikaci zástupných atomů a vazeb, které lze použít k definování substrukturálních dotazů pro vyhledávání v chemické databázi . Jedna běžná mylná představa je, že substrukturální vyhledávání založené na SMARTS zahrnuje porovnávání řetězců SMILES a SMARTS. Ve skutečnosti jsou řetězce SMILES i SMARTS nejprve převedeny na interní grafové reprezentace, které jsou hledány pro izomorfismus podgrafu .

SMIRKS, nadmnožina „reakce SMILES“ a podmnožina „reakce SMARTS“, je řádkový zápis pro specifikaci reakčních transformací. Obecná syntaxe pro rozšíření reakcí je REACTANT>AGENT>PRODUCT (bez mezer), kde kterékoli z polí může být buď prázdné, nebo vyplněno několika molekulami s tečkou ( . ) a další popisy závislé na základním jazyce. Atomy lze dodatečně identifikovat číslem (např. [C:1] ) Pro mapování, například v [CH2:1]=[CH:2][CH:3]=[CH:4][CH2:5][H:6]>>[H:6][CH2:1][CH:2]=[CH:3][CH:4]=[CH2:5] .


Watch the video: Smiles Simplified molecular input line entry system (November 2022).