A Space & astronomy forum. SpaceBanter.com

Go Back   Home » SpaceBanter.com forum » Astronomy and Astrophysics » FITS
Site Map Home Authors List Search Today's Posts Mark Forums Read Web Partners

[fitsbits] 'Dataset Identifications' postings (digest)



 
 
Thread Tools Display Modes
  #1  
Old March 23rd 04, 08:47 AM
Lucio Chiappetti via
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

On 16 Mar 2004, Don Wells wrote a digest :

From: Thomas McGlynn


There is an effort underway at several of the NASA archives to provide
a standard dataset identifier for data that can be retrieved from the
archives. The initial motivation is that when authors publish [...]


motivation understood and agreed

The keyword 'DS_IDENT' has been suggested. Does anyone have objections
to this or do they know of systems that already use this keyword?


I believe this or any other unused name is fine

------------------------------------------------------------------------
From: (Rob Seaman)


NOAO (through "Save the bits") has three or four million discrete FITS
images packaged up into MEF files for purposes of efficient and easy
handling. On the other hand, HEASARC's usage supplies an example
involving one dataset that contains several files.


would the former be "one file 'originally from' many datasets but now
actually a new dataset on its own" ? While the latter seems more
familiar to me. But I can imagine another case, i.e. data retrieved
from a site with a database and containing part of a catalog.

Personally, I think before we reserve "DS_IDENT" or any other keyword
for the purpose of identifying datasets, we should define the concept
of a "dataset".


Yes I think so.

Let me say what is my *understanding* of a "dataset" (which does not mean
it's something I propose as THE definition !) based on some past
experiences.

In the case of an X-ray satellite, typically one has a unit like an
observing proposal [A], which includes one or more pointings. The pointing
[b] occurs in a given time interval, and may involve SIMULTANEOUS
observations by more than one instrument [C]. For each instrument the
overall time may be divided in consecutive time intervals [D] in which a
given instrument configuration is used. There may be many different
telemetry packet streams generated during each interval [D], roughly
speaking many different files ... not even FITS files.

At some stage they might be transformed in a group of many different (FITS
?) files, which will be kept together as a dataset.

Just to make some examples, for the long forlorn Exosat satellite, the
observer was receiving an half-inch tape called a FOT. There was one
logical FOT (maybe spanning several volumes) for each [A][b] combination,
where [b] was called the Observing Period (OP) and [D] were called
"observations". There were many (non-FITS) files for each [C][D]
combination, but I would call the FOT itself as "the dataset". I don't
remember if they originally had an identifier other than the name of the
target and the date. I heard that ESTEC much later had plans to finally
re-archive as FITS event lists, however I haven't followed this.

For BeppoSAX, I'm the culprit of having forced inheritance of the above
naming, with [A][b] being the OP, and [D] observations. BeppoSAX had
FOTs (in the form of DAT cassettes with several non-FITS files) and they
were identified by the OP (sequential) number. A dataset was definitely
"the OP" or "the associated FOT". I would say more "the OP" as ASDC has
been archiving for online access also some reprocessed FITS event lists,
grouped by OP.

For XMM-Newton the naming is different but the concept is similar.
Proposals [A] have a numeric prop-id. [b] are called here "observations"
and have a 4-digit obs-id. [D] are called "exposures". What they used to
give to observers until a while ago was a CD associated to the combination
[A][b] ... and in fact the data were labelled with the concatenation of
prop-id and obs-id e.g. 0065760201. Now they distribute data online only,
but the scheme has ben retained. "The dataset" is the ensemble of all
(many!) (FITS) files pertaining to an [A][b]. I note incidentally that,
although no tapes are used, the "flat" naming scheme is still used with
long horrible file names like P0065760201M1S001EBLSLI0000.FIT.

My personal tendency (but I'm an end user and not an archive mantainer in
this context) would have been to put part of the information in directory
names and not in file names (e.g. for my own BeppoSAX analysis I used
to store files as [A']/[B']/[C]/[D].type, and I tend to use shorter names
also for my own XMM reduction (while "the dataset" as distributed by ESA
contains instead only two directories, one with the semi-raw FITS
reformatted data, and the other one with the pipeline products).

But that (flat or tree) arrangement leaves unchanged the definition of
which files constitute "a dataset".

To go back to another old (but simple) example, in the case of the UV
satellite IUE, nobody cared about the proposal id [A] or the object [b]
when referring to a dataset. The "unit" was one exposure (one spectrum
with a given camera = only one camera operative at any time), or
"image", which had identifiers [C][D], e.g. SWP11056. The data delivered
to the observer was a set of 4-5 files (originally non FITS) for each
"image" (one raw image, and the steps and results of a pipeline). In
this case I would be inclined to consider this group of files as "the
dataset" (irrespective of the fact that more than one, unrelated, could
be placed on a tape)

I'm not terribly familiar with the way a ground site like ESO manages its
archives, but definitely a proposal [A] can refer to many targets [b], and
ultimately to units called "OBs" (Observing Blocks) which are split into
exposures. Exposures taken at different times may be associated (e.g. for
a multi-object spectrograph one can associate the exposure taken with a
given mask with the dark or lamp calibration taken later with the same
mask), so it's this association I'd call "the dataset".

In any case, I've been talking so far of raw, semi-raw or standard-reduced
data archived at the original observatory (or other site in charge of
archiving) pertaining to a pointing of an object at a given time.

More to come below ...

------------------------------------------------------------------------
From: Jonathan McDowell


suppose I have run a modelling tool to get the best deconvolved image
fit simultaneously to ROSAT and CHandra data, and stored the
result in the FITS file. [...]


However, I would say to Thierry that the new file should indeed have a
brand new dataset identifier - you have in this case created a new
dataset. The traceability to the original observations should be done


This is indeed a new case. In general I'm inclined to consider the
result of any analysis (as opposed to plain "reduction") to be
"private" data. One may keep them, but privately. What matters are the
numbers in the published paper.

But there might be cases indeed in which such data could be stored
and made publicly available (forever ?) although not in a mission
archive.

OK, they are "a new dataset" but who names them ? Are we going to
run into things like "official naming authorities", like the awful
"certificates" and "self signed certificates" stuff ? Should we just
delegate it to the journals and/or use the bibcode (somebody said
something like that) ?

There is at least one other different case, databases and catalogues.
E.g. I'm managing the database for the XMM-LSS survey (which is a
survey done *with* XMM by a consortium using some GO time, but not
*by* the XMM ESA project staff, hence "unofficial"). Our collaboration
members (and later the public) can export catalogue subsets as FITS
files. So far I've not worried about "dataset identification".

Of course each RECORD in one of my tables which refer to the XMM data
is associated to an XMM pointing (and its propid-obsid), but I'm not
keeping this info explicit. And there are other tables containing
non X-ray data taken by us (with an optical telescope or with the VLA).
There are tables which are authorized subsets of data taken by other
consortia. There are tables which are pointers to NED or SIMBAD.

Should I really worry here about traceability ? Or just say that the
dataset is the XMM-LSS project (an ORIGIN keyword would be enough !) ?

----------------------------------------------------------------------------
Lucio Chiappetti - IASF/CNR - via Bassini 15 - I-20133 Milano (Italy)
----------------------------------------------------------------------------
L'Italia ripudia la guerra [...] come Italy repudiates war {...] as a
mezzo di risoluzione delle controversie way of resolution of international
internazionali controversies
[Art. 11 Constitution of the Italian Republic]
----------------------------------------------------------------------------
For more info :
http://www.mi.iasf.cnr.it/~lucio/personal.html
----------------------------------------------------------------------------


  #2  
Old March 23rd 04, 03:04 PM
Arnold Rots
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

It may be good to clarify the context and scope of what Tom is
proposing (at least my take on it; I won't claim to speak for Tom).

The proposal is to introduce the DS_IDENT keyword as a convention for
dataset identifiers and to define one particular set of values for
this keyword - the ones under the autority of the ADS, i.e.,
identifier values starting with "ADS/". Anybody who wants to
participate in the use of this convention is free to do so, but will
have to comply with the the rules of that convention, which a

1. the identifier is of the form "ADS/observatory#dataset"
2. observatory must be taken from the list maintained by the ADS
3. dataset values are controlled by the data center or observatory
that bears responsibility for the observatory archive
4. that controlling authority, and its successors and assigns, must
guarantee access to dataset in perpetuity
5. the keeper of the observatory data will provide a specific set of
services that allow identifier verification, harvesting, and access to
the datasets

If someone else wants to define another class of identifiers (i.e.,
other than the "ADS/" class), that is fine, but it would probably be
sensible to make sure that the values and useage comply with IVOA
standards (as the ADS ones do) in order to maximize usefulness and
recognition.

I can tell what they, most likely, will look like for Chandra.
There will be (at least) three groups:
ADS/Sa.CXO#obs/ObsId
Points to a particular observation
ADS/Sa.CXO#defset/name
Points to a specifically defined set of observations
ADS/Sa.CXO#bibcode/bibcode
Points to all information we have for a particular paper

Of course, this begs two questions:
- Can two files have the same DS_IDENT value?
The answer should be yes, since a dataset may consist of more than one
file.
- Can one file belong to more than one dataset?
The answer is again yes. This may mean that we should allow for
DS_IDn keywords.

(I said "files"; you may read "extensions", if you like)

The question has come up in which headers the keyword should appear.
I would recommend putting it in any and all headers where it is
appropriate - primary and secondary.

Hope this helps,

- Arnold

Don Wells wrote:
...

From: Thomas McGlynn
Subject: [fitsbits] Dataset identifications.
Newsgroups: sci.astro.fits
Date: Wed, 10 Mar 2004 14:20:18 -0500
Organization: NASA Goddard Space Flight Center
Reply-To:

There is an effort underway at several of the NASA archives
to provide a standard dataset identifier for data that
can be retrieved from the archives. The initial motivation
is that when authors publish a paper they will be able
to specify the data that was used in analysis and systems
like the ADS will be able to provide links to these data
in a systematic way from the papers (and vice versa for
the archives). Currently this is done for a few datasets
but it's a very manual and labor intensive process. Although
the initial impetus is coming from some of the NASA sites,
we've been talking with the VO efforts and hope that the
ID will be of general utility. I've no doubt that if ID's
become established they will be used in many
different ways.

There are discussions still ongoing as to the exact format
to be used. It is intended that the overall format will be
compatible with the identification standards that are being
discussed in the Virtual Observatory world. An example ID
might be ADS/Sa.ROSAT#X/rh701576n00 where the ADS indicates
the the ADS will provide the high level resolution service,
the 'Sa.ROSAT' is an observatory identifier, and the
element that follows the # is observatory specific, but
should be familiar enough for those who have used ROSAT
data.

The question for this group is not so much a discussion of the format
of the ID. Rather it was pointed out that if these IDs are successful
it would be useful to be able to have a standard
FITS keyword that would indicate the dataset id that the current
file belongs to. The keyword 'DS_IDENT' has been suggested.
Does anyone have objections to this or do they know of systems
that already use this keyword? Googling DS_IDENT returns an album
of Donna Summer's but no FITS references.

Also, are there any issues the we need to resolve regarding
the usage of the keyword? One that comes to mind is whether use of this
keyword should be recommended only for the primary header of a FITS
file. If not then a file may not be associated with a unique dataset
id.

I'd appreciate any comments, questions or thoughts on the subject.

Thanks,
Tom McGlynn
HEASARC
...

--
Donald C. Wells Scientist

http://www.cv.nrao.edu/~dwells
National Radio Astronomy Observatory +1-434-296-0277
520 Edgemont Road, Charlottesville, Virginia 22903-2475 USA
_______________________________________________
fitsbits mailing list

http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits

--------------------------------------------------------------------------
Arnold H. Rots Chandra X-ray Science Center
Smithsonian Astrophysical Observatory tel: +1 617 496 7701
60 Garden Street, MS 67 fax: +1 617 495 7356
Cambridge, MA 02138
USA
http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------

  #3  
Old March 23rd 04, 04:51 PM
LC's No-Spam Newsreading account
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

On Tue, 23 Mar 2004, Arnold Rots wrote:

The proposal is to introduce the DS_IDENT keyword as a convention for
dataset identifiers and to define one particular set of values for
this keyword - the ones under the autority of the ADS, i.e.,


what does it make them "under the authority of the ADS" ? A specific
agreement between ADS and Observatory archive and/or paper author
and/or journal and/or IAU ?

2. observatory must be taken from the list maintained by the ADS


5. the keeper of the observatory data will provide a specific set of
services that allow identifier verification, harvesting, and access to
the datasets


What is an observatory here ? A ground based institution (but in that
case won't it be better to have a telescope-instrument identifier ?) OR
a satellite OR the OFFICIAL data centre of such satellite data ?

This seems to rule out "private" datasets (as I defined in my earlier
posting) - which might be good - but what about "catalogue" datasets ?


If someone else wants to define another class of identifiers (i.e.,
other than the "ADS/" class), that is fine, but it would probably be
sensible to make sure that the values and useage comply with IVOA
standards (as the ADS ones do) in order to maximize usefulness and
recognition.


what is IVOA ?

is this a task for the FITS community (if not maybe we should stop
here, or confine the discussion to few FITS specific items), for some
other IAU body, or for somebody else ?

ADS/Sa.CXO#obs/ObsId
Points to a particular observation
ADS/Sa.CXO#defset/name
Points to a specifically defined set of observations


Once again these seem to point to something which can be assigned
only by an official data centre.

ADS/Sa.CXO#bibcode/bibcode
Points to all information we have for a particular paper


Who is "we" in the above sentence, and what papers should be concerned ?

Any paper published on a journal indexed by the ADS ?
and who is storing the relevant data ? ADS, CDS, data centre, author ?
Any paper on Chandra assuming that the author sends associated
reduced data to the Chandra data centre
Any paper published by Chandra data centre staff only ?

Of course, this begs two questions:
- Can two files have the same DS_IDENT value?
- Can one file belong to more than one dataset?


Yes, but what about the case of the results of a paper regarding
the analysis of some particular observational data ?

The original (starting) data will be stored at some data centre,
but the result will in general be privately owned by the authors,
and do not BELONG TO the original dataset, more they STEM OUT OF
the original dataset (parent-child relation)

--
----------------------------------------------------------------------
is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.

  #4  
Old March 23rd 04, 08:14 PM
Arnold Rots
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

Maybe it helps to state the practical purpose of the identifiers.
It's put in there to inform users as to what dataset identifier to use
if and when they insert such identifiers into their manuscripts.

The purpose of that is to facilitate the linkage between the
literature and the archived datasets. Those links are currently being
maintained by a number of data centers (and the ADS) but it is rather
labor-intensive. This mechanism would allow for automatic harvesting.

More responses below.

- Arnold

LC's No-Spam Newsreading account wrote:
On Tue, 23 Mar 2004, Arnold Rots wrote:

The proposal is to introduce the DS_IDENT keyword as a convention for
dataset identifiers and to define one particular set of values for
this keyword - the ones under the autority of the ADS, i.e.,


what does it make them "under the authority of the ADS" ? A specific
agreement between ADS and Observatory archive and/or paper author
and/or journal and/or IAU ?


The fact that they start with "ADS/". It is indeed tied in with an
agreement between ADS, data centers, journals, aimed at enabling ADS
and data centers to harvest literature-dataset links.


2. observatory must be taken from the list maintained by the ADS


5. the keeper of the observatory data will provide a specific set of
services that allow identifier verification, harvesting, and access to
the datasets


What is an observatory here ? A ground based institution (but in that
case won't it be better to have a telescope-instrument identifier ?) OR
a satellite OR the OFFICIAL data centre of such satellite data ?


You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt


This seems to rule out "private" datasets (as I defined in my earlier
posting) - which might be good - but what about "catalogue" datasets ?


At least under this authority ID (ADS).



If someone else wants to define another class of identifiers (i.e.,
other than the "ADS/" class), that is fine, but it would probably be
sensible to make sure that the values and useage comply with IVOA
standards (as the ADS ones do) in order to maximize usefulness and
recognition.


what is IVOA ?


International Virtual Observatory Alliance


is this a task for the FITS community (if not maybe we should stop
here, or confine the discussion to few FITS specific items), for some
other IAU body, or for somebody else ?


No, not really, but it deals with a convention involving a FITS
keyword which may have repercussion for future use of this keyword.


ADS/Sa.CXO#obs/ObsId
Points to a particular observation
ADS/Sa.CXO#defset/name
Points to a specifically defined set of observations


Once again these seem to point to something which can be assigned
only by an official data centre.


Yes.


ADS/Sa.CXO#bibcode/bibcode
Points to all information we have for a particular paper


Who is "we" in the above sentence, and what papers should be concerned ?


CDA


Any paper published on a journal indexed by the ADS ?


No, the ones for which we know there is a Chandra link (in this example).

and who is storing the relevant data ? ADS, CDS, data centre, author ?


ADS and us.

Any paper on Chandra assuming that the author sends associated
reduced data to the Chandra data centre


Yes, any paper on Chandra data, but no, not linked to products
produced to the author - only the archived datasets produced by CXC
(where the author started from, presumably).

Any paper published by Chandra data centre staff only ?

Of course, this begs two questions:
- Can two files have the same DS_IDENT value?
- Can one file belong to more than one dataset?


Yes, but what about the case of the results of a paper regarding
the analysis of some particular observational data ?

The original (starting) data will be stored at some data centre,
but the result will in general be privately owned by the authors,
and do not BELONG TO the original dataset, more they STEM OUT OF
the original dataset (parent-child relation)


That's correct.


--
----------------------------------------------------------------------
is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.

_______________________________________________
fitsbits mailing list

http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits

--------------------------------------------------------------------------
Arnold H. Rots Chandra X-ray Science Center
Smithsonian Astrophysical Observatory tel: +1 617 496 7701
60 Garden Street, MS 67 fax: +1 617 495 7356
Cambridge, MA 02138
USA
http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------

  #5  
Old March 23rd 04, 09:54 PM
Rob Seaman
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

Arnold Rots writes:

Maybe it helps to state the practical purpose of the identifiers.
It's put in there to inform users as to what dataset identifier to use
if and when they insert such identifiers into their manuscripts.


Thanks! Yes, that does help.

The purpose of that is to facilitate the linkage between the
literature and the archived datasets. Those links are currently being
maintained by a number of data centers (and the ADS) but it is rather
labor-intensive. This mechanism would allow for automatic harvesting.


An eminently desirable goal. This causes me to strengthen my
recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt


A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Rob Seaman
NOAO Science Data Systems
  #6  
Old March 24th 04, 03:11 PM
Thomas McGlynn
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).


However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)


The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope


The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.


First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).


All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...


Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.


Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.
  #7  
Old March 24th 04, 03:11 PM
Thomas McGlynn
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).


However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)


The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope


The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.


First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).


All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...


Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.


Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.

  #8  
Old March 24th 04, 03:11 PM
Thomas McGlynn
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).


However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)


The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope


The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.


First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).


All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...


Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.


Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.
  #9  
Old March 24th 04, 03:11 PM
Thomas McGlynn
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).


However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)


The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope


The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.


First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).


All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...


Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.


Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.

  #10  
Old March 23rd 04, 09:54 PM
Rob Seaman
external usenet poster
 
Posts: n/a
Default [fitsbits] 'Dataset Identifications' postings (digest)

Arnold Rots writes:

Maybe it helps to state the practical purpose of the identifiers.
It's put in there to inform users as to what dataset identifier to use
if and when they insert such identifiers into their manuscripts.


Thanks! Yes, that does help.

The purpose of that is to facilitate the linkage between the
literature and the archived datasets. Those links are currently being
maintained by a number of data centers (and the ADS) but it is rather
labor-intensive. This mechanism would allow for automatic harvesting.


An eminently desirable goal. This causes me to strengthen my
recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt


A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Rob Seaman
NOAO Science Data Systems
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
digest 2453183 Frederick Shorts Astronomy Misc 3 July 1st 04 08:29 PM
[fitsbits] Dataset identifications. Jonathan McDowell FITS 3 March 12th 04 03:57 PM
[fitsbits] Dataset identifications. Thierry Forveille FITS 12 March 12th 04 02:33 PM
[fitsbits] Dataset identifications. Thomas McGlynn FITS 0 March 10th 04 07:20 PM
antagonist's digest, volume 2452854 dizzy Astronomy Misc 4 August 7th 03 01:02 AM


All times are GMT +1. The time now is 06:34 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 SpaceBanter.com.
The comments are property of their posters.