PRABI
Rhone-Alpes Bioinformatics Center


PRABI : LBBE

PBIL tutorials

Quick Search Tutorial

Searching for Sequences

Quick Search is dedicated to a quick search for sequences or sequence families in the databases available on the PBIL server. It is an alternative to WWW Query which allows more complex queries. Quick Search allows you to retrieve sequences or sequence families associated to a single word without specifying what is this word. You can enter indifferently a keyword, a sequence name or accession number, or a taxa name.

Quick Search should be used to retrieve quickly in one single operation sequences (or sequence families) associated to a word. This word can be a keyword, a species, a sequence name or an accession number.

Once you enter a word, the selected database will be queried to find sequences :

* with an accession number identical to this word

* with a name begining like this word

* associated to a keyword matching this word

* from a species or a taxa matching this word.

A list of sequences is generated for each of these queries. Each of these lists is named respectively AC, name, keyword, species and is accessible (if not empty) on the top of the results page. Lists may present common sequences, a list may be contained by another list, lists may be totally independant. Finally all the lists are merged into a global list named all displayed on the results page.

Optionally you can search for an exact match instead a loose match : if you are looking for the keyword "polymerase" a strict match will select sequences associated to "polymerase", and a loose match will select sequences associated to "rna polymerase", "poly(a)polymarase", etc.

Notes

- database queries are not case sensitive

- the maximum length for species and keywords is 40 characters

- due to the continuously increasing number of sequences in the databases, the results you obtain may be different than the results given in the examples

The following examples may help you to understand the type of requests you can do, to understand the results you get, to avoid errors particulary in the use of the strict match option and in the selection of the different lists you get.

Remember that this tool is dedicated for a quick search which is not always extremely precise. Be aware that you may miss interesting sequences or select uninteresting sequences if your are using carelessly the Quick Search tool. This tool is especially useful for a quick pre-selection of sequences preceding a human analysis. If you want to do more specific queries you will have to use WWW Query.

Example 1.1 : retrieve protein sequences from mus musculus

Enter mus musculus in the field, choose SwissProt as database and press "Go".

You obtain more than 45 000 sequences. On the top of the page, there is link to the get access to the lists "species" and "keyword". The first list, called "species", contains the sequences from the species or taxa matching mus musculus. The second list, called "keyword", contains the sequences with a keyword matching mus musculus. Of course the "species" list contains more sequences than the "keyword" list. The global merged list contains more sequences than the "species" list, meaning that there is somes sequences associated to the keyword "mus musculus" which are not sequences from mus musculus.

If you are interested only in the sequences from mus musculus, use the "species" list instead the global list.

Example 2.1 : retrieve protein sequences associated to the keyword "insulin"

Enter insulin in the field, choose Hovergen prot as database and press "Go".

You obtain about 1000 sequences in the "keyword" list , associated to a keyword matching "insulin", as "insulin receptor", "insulin family" or "insulin-like growth factor".



1. AAAT_MOUSE  Neutral amino acid transporter B (Insulin-activated amino acid
          transporter) (ASC-like Na(+) dependent neutral amino acid transporter
          ASCT2).

2. ALS_HUMAN  Insulin-like growth factor binding protein complex acid labile chain
          precursor (ALS).

3. ALS_MOUSE  Insulin-like growth factor binding protein complex acid labile chain
          precursor (ALS).

4. ALS_PAPHA  Insulin-like growth factor binding protein complex acid labile chain
          precursor (ALS).

5. ALS_RAT  Insulin-like growth factor binding protein complex acid labile chain
          precursor (ALS).

6. BXA1_BOMMO  Bombyxin A-1 precursor (BBX-A1) (4K-prothoracicotropic hormone) (4K-
          PTTH).

7. BXA1_SAMCY  Bombyxin A-1 homolog precursor.

8. BXA2_BOMMO  Bombyxin A-2 precursor (BBX-A2) (4K-prothoracicotropic hormone) (4K-
          PTTH).

...
...

Example 2.2 : retrieve protein sequences associated strictly to the keyword "insulin"

As in the previous example, enter insulin in the field, choose SwissProt as database, but this time check the "exact match" box and press "Go".


1. INS_ACIGU  Insulin.

2. INS_ACOCA  Insulin.

3. INS_ALLMI  Insulin.

4. INS_AMICA  Insulin.

5. INS_ANAPL  Insulin precursor.

6. INS_ANGAN  Insulin precursor (Fragment).

7. INS_ANGRO  Insulin.

8. INS_ANSAN  Insulin.

9. INS_AOTTR  Insulin precursor.

10. INS_BALBO  Insulin.

11. INS_BALPH  Insulin.

12. INS_BOVIN  Insulin precursor.

...
...

You obtain less than 100 sequences, associated to the keyword "insulin". There is only few sequences, compared to the previous results. This due to the fact that when a sequence presents the keyword "insulin receptor", this does not imply that "insulin" is a keyword of this sequence.

Example 2.3 : retrieve protein sequences associated strictly to the keyword "insulin receptor"

Enter insulin receptor in the field, choose SwissProt as database, but this time check the "exact match" box and press "Go".

You obtain about 30 sequences, associated to a keyword "insulin receptor". These sequences are different from the sequences you obtained in 2.2 by searchig an exact match with "insulin"

Example 2.4 : retrieve protein sequences associated to a keyword matching "insulin receptor"

Enter insulin receptor in the field, choose SwissProt as database, but this time deselect the "exact match" box and press "Go".

You obtain about 100 sequences, associated to a keyword matching "insulin receptor", as"insulin receptor binding protein" or "insulin receptor homolog".

Example 3.1 : retrieve EMBL sequences associated to the gene "BTG1"

Enter BTG1 in the field, choose EMBL as database, select the "exact match" box and press "Go".

You obtain 15 sequences, associated to a keyword "BTG1". Indeed a gene name is considered as a keyword in the ACNUC system. You can check this in the sequence annotations of the CDS BC006834.BTG1 :


 
BC006834.BTG1        Location/Qualifiers
FT   CDS_pept        299..814
FT                   /codon_start=1
FT                   /db_xref="GOA:P31607"
FT                   /db_xref="UniProt/Swiss-Prot:P62325"
FT                   /gene="Btg1"
FT                   /product="B-cell translocation gene 1, anti-proliferative"
FT                   /protein_id="AAH06834.1"
FT                   /translation="MHPFYTRAATMIGEIAAAVSFISKFLRTKGLTSERQLQTFSQSLQ
FT                   ELLAEHYKHHWFPEKPCKGSGYRCIRINHKMDPLIGQAAQRIGLSSQELFRLLPSELTL
FT                   WVDPYEVSYRIGEDGSICVLYEASPAGGSTQNSTNVQMVDSRISCKEELLLGRTSPSKN
FT                   YNMMTVSG"
...
...

Example 3.2 : retrieve EMBL sequences associated to the word "BTG1"

Enter BTG1 in the field, choose EMBL as database, deselect the "exact match" box and press "Go".

You obtain 45 sequences. On the top of the page, there is link to the get access to the lists "name" and "keyword". The first list, called "name", contains the sequences with a name begining with btg1. The second list, called "keyword", contains the sequences with a keyword matching btg1. The "name" list contains 20 sequences, and the "keyword" list contains 25 sequences.

In this example the list "keyword" contains more sequences (25) than the list you obtained previously in example 2.1 (15) . It means that there was some keywords matching "BTG1" in addition of the "BTG1" keyword. It means that you were missing 10 sequences associated to the keyword "BTG1". For example some of these 10 sequences are associated with the gene "XBTG1" or the keyword "putative BTG1 binding factor 1". Are they really related to what you are looking for ?

The list "name" contains sequences with a name begining with BTG1, as BTG100, BTG101, BTG102. Are they really related to what you are looking for ?

Note that 20 + 25 is 45, which is the number of sequences in the global merged list. It means that there is no sequence which are present in both "name" and "keyword". This is very important because it means that none of the sequences in "name" has something to do with the keyword "btg1". So if you are looking for sequences associated to the btg1 gene, you will be wrong to select the whole global merged sequence. Note that some of the sequences in "keyword" are named BC006834.BTG1 but are not foud in the "name" list. This come from the fact that for a question of rapidity only sequence names begining as the word are selected, i.e. BTG100 is selected but not BC006834.BTG1. This choice has been done because it is very long to search matches at the end of sequence name, and this type of query is very rare : usualy you will search a sequence according to the beging of its name, especially for protein sequences.

Example 3.3 : retrieve protein sequences associated to "BTG1"

Enter BTG1 in the field, choose SwissProt as database, select the "exact match" box and press "Go".

You obtain 0 sequences !

Enter BTG1 in the field, choose SwissProt as database, deselect the "exact match" box and press "Go".

You obtain 8 sequences. On the top of the page, there is links to the get access to the lists "name" and "keyword". The first list, called "name", contains the sequences which have a name begining with btg1. The second list, called "keyword", contains the sequences with a keword matching btg1. The "name" list contains 5 sequences, and the "keyword" list contains 8 sequences.

The global merged list contains 8 sequences as well. The keyword list and the merged liste are identical. It means that all the sequences which are present in "name" are present in "keyword", what is different from the example 2.2.


1. BTG1_BOVIN  BTG1 protein (B-cell translocation gene 1 protein) (Myocardial
          vascular inhibition factor) (VIF).

2. BTG1_CHICK  BTG1 protein (B-cell translocation gene 1 protein).

3. BTG1_HUMAN  BTG1 protein (B-cell translocation gene 1 protein).

4. BTG1_MOUSE  BTG1 protein (B-cell translocation gene 1 protein).

5. BTG1_RAT  BTG1 protein (Anti-proliferative factor).

6. CNO7_HUMAN  CCR4-NOT transcription complex subunit 7 (CCR4-associated factor 1)
          (CAF1) (BTG1 binding factor 1).

7. Q9PVQ0  Xbtg1 (Btg1-prov protein).

8. Q9S9P2  T24D18.2 protein (Putative BTG1 binding factor 1).

The 5 first sequences are clearly related to BTG1. These sequences are associated to the keyword "btg1 protein" and its names begin with "btg1". These 5 sequences are the sequences found in the list "name". The 3 last sequences are associated respectively to the keywords "BTG1 binding factor 1", "Btg1-prov protein" and "Putative BTG1 binding factor 1".

Example 3.4 : retrieve protein sequences associated to "BTG1 protein"

Enter BTG1 protein in the field, choose SwissProt as database, select the "exact match" box and press "Go".

You obtain 5 sequences in the list "keyword" , which are in fact the 5 sequences you obtained previously in example 2.3 in the list "name".

Enter BTG1 protein in the field, choose SwissProt as database, deselect the "exact match" box and press "Go".

Once again you obtain 5 sequences in the list "keyword". Example 3.5 : retrieve protein sequences associated to "BTG"

Enter BTG protein in the field, choose SwissProt as database, deselect the "exact match" box and press "Go".

You obtain 33 sequences. On the top of the page, there is links to the get access to the lists "name", "species" and "keyword". The "name" list contains 12 sequences, the "species" list contains 1 sequence and the "keyword" list contains 32 sequences.

The global merged list contains 33 sequences.

It seems obvious that as in the example 2.3 the 12 sequences of "name" exist in "keyword", and then there is 32 sequences + 1 sequence from "species" = 33 sequences in the total list.

The 12 sequences of "name" are associated to the keywords "BTG1 protein", "BTG2 protein", "BTG3 protein" and "BTG4 protein". Among the 32 sequences of "keyword" some are associated to "BTGa protein", "BTGb protein", "BTG-26", "LYSINE-N-OXYGENASE MBTG", etc. and are not closely related to the 12 sequences found in "name".

If you click on the "species" , you find the sequence


Q9F4D7  RepA1 protein.

If you check the sequence annotation you get :


ID   Q9F4D7      PRELIMINARY;      PRT;   278 AA.
AC   Q9F4D7;
DT   01-MAR-2001 (TrEMBLrel. 16, Created)
DT   01-MAR-2001 (TrEMBLrel. 16, Last sequence update)
DT   01-JUN-2003 (TrEMBLrel. 24, Last annotation update)
DE   RepA1 protein.
GN   Name=repA1;
OS   Buchnera aphidicola.
OG   Plasmid pleu-BTg.
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
OC   Enterobacteriaceae; Buchnera.
OX   NCBI_TaxID=9;
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=Tuberolachnus salignus;
RX   MEDLINE=20461460; PubMed=10984505; DOI=10.1073/pnas.180310197;
RA   Van Ham R.C.H.J., Gonzalez-Candelas F., Silva F.J., Sabater B.,
RA   Moya A., Latorre A.;
RT   "Post-symbiotic plasmid acquisition and evolution of the repA1-
RT   replicon in Buchnera aphidicola.";
RL   Proc. Natl. Acad. Sci. U.S.A. 97:10855-10860(2000).
RN   [2]
...
...

The OG field gives the organelle associated to the sequence which is :


Plasmid pleu-BTg.

It is clear that this sequence is not related to any BTG protein, it has been selected because of the organelle name.

More on Quick Search ...