[fitsbits] 'Dataset Identifications' postings (digest)

#11 March 23rd 04, 09:54 PM

Arnold Rots writes:

Maybe it helps to state the practical purpose of the identifiers.
It's put in there to inform users as to what dataset identifier to use
if and when they insert such identifiers into their manuscripts.

Thanks! Yes, that does help.

The purpose of that is to facilitate the linkage between the
literature and the archived datasets. Those links are currently being
maintained by a number of data centers (and the ADS) but it is rather
labor-intensive. This mechanism would allow for automatic harvesting.

An eminently desirable goal. This causes me to strengthen my
recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Rob Seaman
NOAO Science Data Systems

#12 March 23rd 04, 09:54 PM

Arnold Rots writes:

Maybe it helps to state the practical purpose of the identifiers.
It's put in there to inform users as to what dataset identifier to use
if and when they insert such identifiers into their manuscripts.

Thanks! Yes, that does help.

The purpose of that is to facilitate the linkage between the
literature and the archived datasets. Those links are currently being
maintained by a number of data centers (and the ADS) but it is rather
labor-intensive. This mechanism would allow for automatic harvesting.

An eminently desirable goal. This causes me to strengthen my
recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Rob Seaman
NOAO Science Data Systems

#13 March 24th 04, 03:11 PM

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.

#14 March 24th 04, 03:11 PM

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.

#15 March 24th 04, 03:11 PM

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.

#16 March 24th 04, 03:11 PM

I think there are two different things that are getting confused
in this E-mail discussion. They are closely related, but I think
one is possible to address here, while the other requires a much
broader venue than this list can provide.

When I initiated this discussion I was asking if we would make
sense to reserve a keyword in FITS that would be used to specify
an identification of the datasets to which the file or HDU belonged.
While there is currently a specific format for that identification
being considered by some of us, I don't believe it is necessary
to tie the the question of whether we define such a keyword
with any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined
(other than that it is a string). It is at that level that I believe
we could agree on using DS_IDENT (or any other value or values).

So I see the discussion about where such a keyword would go,
whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could
hope to hash out in a discussion here. My own read on this
part of the discussion is that most people would want to see the
ID repeated in all relevant HDU's and that there probably needs
to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.
I just don't think we can muster the kind of representation from
the scientific community that would be needed. While the ADEC hopes
that our IDs will be useful and that others will adopt them, we
have no power to force such a change -- though the astronomy
journals may have a bit broader influence. So if, for example,
NOAO were to adopt a different syntax and style for the dataset IDs, for
good and sufficient reasons of their own, then they could use the same
keyword or keywords and go ahead on their own. It would be desirable
in this case if it was possible to distinguish the different syntaxes
used. Regardless I think it would be better to have a standard place to look
for the IDs than for software to have to look for a list of
keywords and see if there was ADSID or ADECID or NOAOID or NRAOID or CDSID
or .... The standard keyword[s] would say where to look and with only
a minimal level of collaboration we could make sure our different syntaxes
didn't interfere with one another. If a new institution decided to create
some new id schema they would know where to put it, and I think the chance
that existing software could find and use that ID would be much greater.

That said I'm not really disagreeing with Bob that discussion of the syntax
of the IDs is necessary. All I'm saying is that I don't think we can
come to a conclusion to that discussion here.

It's easy enough to continue though, and I've added a couple
of more specific comments below. (:

Tom

Rob Seaman wrote:

recommendation that the reserved keyword name(s) be ADSID and ADSIDnnn.
(I imagine a thousand ADS dataset identifiers are sufficient for a
particular FITS HDU - are they?)

The basic idea of the IDs as they have been conceived of by the ADEC
is that it allows establishment of individual namespaces. So, if for
example NOAO doesn't like the naming scheme that used, it would
be straightforward to create a set of noao/... ids that conformed
to what would be appropriate for your datasets.

A very interesting list. Might I suggest that this list be itself
scrubbed and extended as part of this process? There is a lot of
confusion about the organizations contained on the list. For instance,
here are the overtly NOAO related entries:

KPNO.12m Kitt Peak National Observatory/12 meter Telescope
KPNO.2.1m Kitt Peak National Observatory/2.1 meter Telescope
KPNO.BT Kitt Peak National Observatory/Bok Telescope
KPNO.MAYALL Kitt Peak National Observatory/Mayall Telescope
KPNO.MDMHT Kitt Peak National Observatory/MDM Hitner Telescope
KPNO.MDMMH Kitt Peak National Observatory/MDM HcGraw-Hill Telescope
KPNO.MPT Kitt Peak National Observatory/McMath-Pierce Telescope
KPNO.SARA Kitt Peak National Observatory/Southeastern Association
for Reasearch in Astronomy Telescope
KPNO.SWT Kitt Peak National Observatory/Space Watch Telescope
KPNO.WIYN Kitt Peak National Observatory/WYIN,
Wisconson-Indiana-Yale-NOAO Telescope

CTIO.1.5m Cerro Tololo Inter-American Observatory/1.5 meter Telescope
CTIO.2MASS Cerro Tololo Inter-American Observatory/2MASS Telescope
CTIO.VBT Cerro Tololo Inter-American Observatory/Victor Blanco
Telescope
CTIO.YALO Cerro Tololo Inter-American Observatory/YALO,
Yale-AURA-Lisbon-OU Telescope

The syntax that was suggested was observatoryLocation.telescope
as the way of identifying datasets in a way that will be most
straightforware for users. This list was suggested by
someone at ApJ as I recall. There has been some discussion
about how and if these should be tied to organizations.
One concern with organizational ties is that these ID's are
intended to be permanent. So 50 years from it may be irrelevant to
users that a particular telescope was for a time run by a given
organization, and it's certainly possible that control of a telescope
(and its data) will shift from one organization
to another over the course of its lifetime. In the NASA world, that's actually
quite normal.

First, note that the "National Optical Astronomy Observatory" is not
mentioned yet NOAO is likely the legal owner of many data products
resulting from some of these facilities.

Second, note:

1) that data from KPNO.12m is owned (I would think) by *NRAO* (as is
the telescope),
2) that data from KPNO.BT and KPNO.SWT is owned by the University
of Arizona (or perhaps the state of Arizona),
3) that data from KPNO.MPT is owned by the National Solar Observatory,
4) that data from KPNO.MDMHT and KPNO.MDMMH is owned by whoever owned
MDM during the epoch of the observations in question,
5) that data from KPNO.SARA is owned by the SARA consortium,
6) that data from KPNO.WIYN is owned by the WIYN consortium, one
member of which is NOAO,
7) that there are two 2MASS telescopes and only one is at CTIO
8) that CTIO.YALO was run by the - you guessed it - YALO consortium
and has since ceased operations

Right, our thought is that organizations will register as
responsible for particular dataset holdings. So, e.g., the YALO consoritium would
have registered as responsible for that holding and when it ceased
operations whoever has inherited responsibility for the holding (if anyone)
could register as the responsible party. Thus the granularity of the
datasets holdings needs to be small enough that a single party is likely to be
responsible for each.

It is quite likely that I got some of those nuances wrong myself :-)

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

No... As I mentioned above we didn't do this. If we had we surely wouldn't
have lumped all space observatories together! It may be that rather
than KPNO and CTIO they should be KP and CT. That certainly seems
reasonable to me. I don't think this list is set in concrete
or even particularly old jello.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,
instrument, archive or pipeline - or that ownership may devolve to a
specific member of some consortium. In many cases, one imagines that
a funding agency or government or perhaps even the "people of the United
States of America" may ultimately own a particular data product.

So, an example. NOAO operates twin 8Kx8K mosaic wide field imagers
at its sites on Kitt Peak in Arizona and on Cerro Tololo in Chile.
Depending on the phase of the moon (quite literally :-) the resulting
data may be owned by NOAO or by some instrumentalities associated with
the University of Wisconsin, Indiana University, Yale University and
in the near future perhaps the University of Maryland. Confounded with
this question of ownership is the issue of proprietary rights. Time
on NOAO facilities is awarded competitively and the successful PIs are
rewarded with sole access for some period (typically 18 months).

All of these issues are certainly complex, but in some sense they
are irrelevant. Either the organizations can work out some
agreements about how data are named that can be put into
a dataset id, or they can't and it won't happen. I don't
think we need to solve every problem to have a useful
capability.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency
Observatory
Consortium member ("partner")
Telescope
Instrument
Date&Time
Proposal ID
PI and/or project ID
...

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation. Why should an ID have the time?
One might choose to use the time in the ID. But there is not reason
why it has to be done that way. Why does it need a proposal ID, nation, agency?
Again you can choose to put them there, but I see no requirement why the
general ID specification needs to include this. We are not trying
to use the ID as a way of encapsulating the description of the
dataset, just a way to point to it.

The more I listen to myself talk, the more I convince (myself, anyway :-)
that a single DS_IDENT keyword is a very poor match to the underlying
requirements. Not only might a single file belong to multiple datasets
certified by a particular entity (like ADS), but they may belong to
multiple other datasets certified by multiple other entities - and more
to the point, the design of the certification process will vary from one
to the next to the next.

In particular, the NOAO Science Archive has been discussing the precise
questions of ownership and proprietary access and had already selected
a subset of fields along the lines of Observatory (NOAO, WIYN, SOAR, etc.),
Partner (NOAO, Wisconsin, Indiana, Yale, Brazil, etc.), Telescope (kp4m,
ct4m, wiyn, soar, etc.), Instrument (too many to list), Date&Time, and
(most similar to the ADS scheme) the NOAO Proposal ID spanning all these
facilities. Whatever we settle on will never fit within the confines of
any single keyword. On the other hand, I'd love to *also* include an
ADSID tag to even further constrain the provenance.

Agreeing on metadata fields is great, but I think it's
largely orthogonal to the question of whether we want a dataset id somewhere
as indeed your last comment suggests.

#17 March 24th 04, 05:22 PM

Tom McGlynn writes:

I think there are two different things that are getting confused
in this E-mail discussion.

That's precisely my point. Perhaps you can first clarify whether you
and Arnold are talking about the same requirements and resulting proposal.
If I understood the discussion of ADS identifiers, these supply a very
rich namespace with "multi-mission" support. More to the point, the ADS
identifiers benefit from network externality - the more headers contain
them from more projects at more instititutions, the greater the value
of the identifiers to the community as a whole.

However, I do not think that this is the appropriate forum for
discussion of a particular syntax for the value of this keyword.

It may well be that all astronomical semantic discussions should now
happen under the happy VO umbrella. Personally, I think FITS has too
often skirted the difficult issues. If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means? If not, why do we care if
an obscure set of keyword names are reserved at all?

My own read on this part of the discussion is that most people would
want to see the ID repeated in all relevant HDU's

Yes.

and that there probably needs to be at least an option for the id to
be a vector value. The later requirement mandates a shorter keyword
(perhaps just DSID).

If by vector, you mean repeated keywords from the same or different ID
families, I agree. IDs are long strings. Won't fit many in 80 chars.

However, I do not think that this is the appropriate forum for
discussion of a particular syntax for the value of this keyword.

I think you must mean semantics, not syntax. We can't very well express
an opinion on the contents of a keyword whose legal values aren't discussed.
I thought Arnold did a fine job of starting to lay down the ground rules.
There isn't much point for the mechanism if all the proposal states is
"any string value".

While the ADEC hopes that our IDs will be useful and that others will
adopt them, we have no power to force such a change -- though the
astronomy journals may have a bit broader influence.

The FITS standards process is precisely the way to encourage conforming
usage. Arnold's message described a mechanism that sounded very
useful. NOAO is actively (very actively) pursuing a rich archive
facility for our large variety of high value astronomical data. Any
mechanism that can leverage the value of our data will be gratefully
adopted. If there are multiple astronomical naming conventions, we
may well support more than one. Why shouldn't these separate IDs
with separate semantics resulting from the separate constraints of
separate requirements be hosted in separate keywords?

So if, for example, NOAO were to adopt a different syntax and style
for the dataset IDs, for good and sufficient reasons of their own, then
they could use the same keyword or keywords and go ahead on their own.

There is an assumption here that the simple keyword being proposed will
successfully map onto an entirely different ID model. I suspect the
NOAO IDs (that do need to include the details that Tom seems to find
unpersuasive) will require several keywords. We'll likely populate a
large number of NSAxxxxx keywords with all sorts of info. Not all
FITS keyword usage has to be explicitly covered under the standard
(although a community wide keyword dictionary would be gratefully
received).

It would be desirable in this case if it was possible to distinguish
the different syntaxes used. Regardless I think it would be better
to have a standard place to look for the IDs than for software to
have to look for a list of keywords and see if there was ADSID or
ADECID or NOAOID or NRAOID or CDSID or ...

So, you're basically suggesting that software loop over all DSIDnnnn
to locate all the dataset identifiers. This may be a useful feature.
On the other hand, I haven't come up with a reason that I would need
to look for an identifier whose namespace I wasn't already interested
in. A simple keyword query will return ADSID (posited to be of general
community wide interest) and a second query will return NOAOID (of
specific interest only to NOAO staff and users). My software can
then generate a report or whatever that ties the two together. Give
me a use case for needing to retrieve a long list of opaque identifiers
related to projects completely outside my bailiwick.

Meanwhile, the DSIDnnnn scheme will require a potentially very expensive
traversal of several keywords in every header being considered. Imagine
mapping your header keywords to DB schema. Isn't your DB simply going
to contain a column (perhaps a vector) named ADS_ID? Our DB will certainly
contain a column named something like NOAO_ID. What is the value in
piling up a bunch of unrelated information under the same keyword
heading?

The whole notion of keyword=value pairs is that the keyword identity
supplies some of the information. When I want the date of an
observation, I query DATE-OBS, not a list of DATEnnnn keywords,
searching for a string matching "OBS/20040324T170325Z". In effect,
the DSIDnnnn scheme asserts that our users won't find any direct use
for the dataset identifiers, otherwise we wouldn't make it so hard for
them to get at them. Instead of:

cl hselect *.fits adsid yes
"this is an ADS ID string"

they would have to do something like:

cl for (i=1; i=3; i+=1) {
hselect ("test*.fits", "dsid000"//i, yes) | match ADS
}
"this is an ADS ID string"

I'm sure you can see the usage issues immediately. Here are just two.
What about DSIDnnnn values that contain a substring matching one of the
supported naming authorities (or one that is added in the future)?
How is the search truncated when you don't know that there are exactly
three keywords to start? Sure, a programmer can work around each of
these - but we add keywords for the benefit of our unsophisticated
users, too :-)

This also begs the question of identifiers for individual FITS HDUs.
A particular FITS file or HDU may belong to multiple datasets. A
particular HDU has a single identity, however. Shouldn't part of
this discussion include how to supply a community wide identifier
for each separate FITS object? Imagine starting with a dataset ID.
Doesn't that set ID have to coexist with some mechanism for referencing
all of its many members?

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation.

Ah! To return to Lucio's contribution:

My personal tendency (but I'm an end user and not an archive mantainer
in this context) would have been to put part of the information in
directory names and not in file names (e.g. for my own BeppoSAX
analysis I used to store files as [A']/[B']/[C]/[D].type,

A familiar issue is how to tie an archive's data stores together with
its metadata DB. NOAO is specifically considering precisely this
directory tree structure for our raw data store and also how to tie
it into the resulting headers/DB.

Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space. Time is key to groundbased observations because access to our
telescopes (and the resulting proprietary ties that bind) is distributed
via the calendar and clock.

One might choose to use the time in the ID. But there is not reason
why it has to be done that way.

This is precisely why different dataset IDs might require very different
FITS support. IDs generated for tying publications to data, for instance,
are likely going to be very different than IDs generated for tying data
objects to telescopes or archives.

Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization. I am very supportive
of taking the very long term view of provenance. Over the very long
term, perhaps the fact that an entity known as NOAO used to own the
data may no longer matter. Perhaps a particular national observatory
will no longer exist because a particular nation will no longer exist :-)
(In his salad days, a college professor of mine set up the Iranian
National Observatory before the Shah fell...) In the long term we're
all dead :-)

However, NOAO's current need is precisely to consider who owns what and
who may have access to data and precisely when. Your mileage may vary -
which is what says to me that an ADS ID scheme should be placed in an
ADS branded keyword. It isn't that I have an issue with ADS dataset
IDs - far from it. I have an issue with a single style of dataset ID
coopting the entire notion of placing data within sets.

Rob

#18 March 24th 04, 05:22 PM

Tom McGlynn writes:

I think there are two different things that are getting confused
in this E-mail discussion.

That's precisely my point. Perhaps you can first clarify whether you
and Arnold are talking about the same requirements and resulting proposal.
If I understood the discussion of ADS identifiers, these supply a very
rich namespace with "multi-mission" support. More to the point, the ADS
identifiers benefit from network externality - the more headers contain
them from more projects at more instititutions, the greater the value
of the identifiers to the community as a whole.

However, I do not think that this is the appropriate forum for
discussion of a particular syntax for the value of this keyword.

It may well be that all astronomical semantic discussions should now
happen under the happy VO umbrella. Personally, I think FITS has too
often skirted the difficult issues. If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means? If not, why do we care if
an obscure set of keyword names are reserved at all?

My own read on this part of the discussion is that most people would
want to see the ID repeated in all relevant HDU's

Yes.

and that there probably needs to be at least an option for the id to
be a vector value. The later requirement mandates a shorter keyword
(perhaps just DSID).

If by vector, you mean repeated keywords from the same or different ID
families, I agree. IDs are long strings. Won't fit many in 80 chars.

However, I do not think that this is the appropriate forum for
discussion of a particular syntax for the value of this keyword.

I think you must mean semantics, not syntax. We can't very well express
an opinion on the contents of a keyword whose legal values aren't discussed.
I thought Arnold did a fine job of starting to lay down the ground rules.
There isn't much point for the mechanism if all the proposal states is
"any string value".

While the ADEC hopes that our IDs will be useful and that others will
adopt them, we have no power to force such a change -- though the
astronomy journals may have a bit broader influence.

The FITS standards process is precisely the way to encourage conforming
usage. Arnold's message described a mechanism that sounded very
useful. NOAO is actively (very actively) pursuing a rich archive
facility for our large variety of high value astronomical data. Any
mechanism that can leverage the value of our data will be gratefully
adopted. If there are multiple astronomical naming conventions, we
may well support more than one. Why shouldn't these separate IDs
with separate semantics resulting from the separate constraints of
separate requirements be hosted in separate keywords?

So if, for example, NOAO were to adopt a different syntax and style
for the dataset IDs, for good and sufficient reasons of their own, then
they could use the same keyword or keywords and go ahead on their own.

There is an assumption here that the simple keyword being proposed will
successfully map onto an entirely different ID model. I suspect the
NOAO IDs (that do need to include the details that Tom seems to find
unpersuasive) will require several keywords. We'll likely populate a
large number of NSAxxxxx keywords with all sorts of info. Not all
FITS keyword usage has to be explicitly covered under the standard
(although a community wide keyword dictionary would be gratefully
received).

It would be desirable in this case if it was possible to distinguish
the different syntaxes used. Regardless I think it would be better
to have a standard place to look for the IDs than for software to
have to look for a list of keywords and see if there was ADSID or
ADECID or NOAOID or NRAOID or CDSID or ...

So, you're basically suggesting that software loop over all DSIDnnnn
to locate all the dataset identifiers. This may be a useful feature.
On the other hand, I haven't come up with a reason that I would need
to look for an identifier whose namespace I wasn't already interested
in. A simple keyword query will return ADSID (posited to be of general
community wide interest) and a second query will return NOAOID (of
specific interest only to NOAO staff and users). My software can
then generate a report or whatever that ties the two together. Give
me a use case for needing to retrieve a long list of opaque identifiers
related to projects completely outside my bailiwick.

Meanwhile, the DSIDnnnn scheme will require a potentially very expensive
traversal of several keywords in every header being considered. Imagine
mapping your header keywords to DB schema. Isn't your DB simply going
to contain a column (perhaps a vector) named ADS_ID? Our DB will certainly
contain a column named something like NOAO_ID. What is the value in
piling up a bunch of unrelated information under the same keyword
heading?

The whole notion of keyword=value pairs is that the keyword identity
supplies some of the information. When I want the date of an
observation, I query DATE-OBS, not a list of DATEnnnn keywords,
searching for a string matching "OBS/20040324T170325Z". In effect,
the DSIDnnnn scheme asserts that our users won't find any direct use
for the dataset identifiers, otherwise we wouldn't make it so hard for
them to get at them. Instead of:

cl hselect *.fits adsid yes
"this is an ADS ID string"

they would have to do something like:

cl for (i=1; i=3; i+=1) {
hselect ("test*.fits", "dsid000"//i, yes) | match ADS
}
"this is an ADS ID string"

I'm sure you can see the usage issues immediately. Here are just two.
What about DSIDnnnn values that contain a substring matching one of the
supported naming authorities (or one that is added in the future)?
How is the search truncated when you don't know that there are exactly
three keywords to start? Sure, a programmer can work around each of
these - but we add keywords for the benefit of our unsophisticated
users, too :-)

This also begs the question of identifiers for individual FITS HDUs.
A particular FITS file or HDU may belong to multiple datasets. A
particular HDU has a single identity, however. Shouldn't part of
this discussion include how to supply a community wide identifier
for each separate FITS object? Imagine starting with a dataset ID.
Doesn't that set ID have to coexist with some mechanism for referencing
all of its many members?

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation.

Ah! To return to Lucio's contribution:

My personal tendency (but I'm an end user and not an archive mantainer
in this context) would have been to put part of the information in
directory names and not in file names (e.g. for my own BeppoSAX
analysis I used to store files as [A']/[B']/[C]/[D].type,

A familiar issue is how to tie an archive's data stores together with
its metadata DB. NOAO is specifically considering precisely this
directory tree structure for our raw data store and also how to tie
it into the resulting headers/DB.

Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space. Time is key to groundbased observations because access to our
telescopes (and the resulting proprietary ties that bind) is distributed
via the calendar and clock.

One might choose to use the time in the ID. But there is not reason
why it has to be done that way.

This is precisely why different dataset IDs might require very different
FITS support. IDs generated for tying publications to data, for instance,
are likely going to be very different than IDs generated for tying data
objects to telescopes or archives.

Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization. I am very supportive
of taking the very long term view of provenance. Over the very long
term, perhaps the fact that an entity known as NOAO used to own the
data may no longer matter. Perhaps a particular national observatory
will no longer exist because a particular nation will no longer exist :-)
(In his salad days, a college professor of mine set up the Iranian
National Observatory before the Shah fell...) In the long term we're
all dead :-)

However, NOAO's current need is precisely to consider who owns what and
who may have access to data and precisely when. Your mileage may vary -
which is what says to me that an ADS ID scheme should be placed in an
ADS branded keyword. It isn't that I have an issue with ADS dataset
IDs - far from it. I have an issue with a single style of dataset ID
coopting the entire notion of placing data within sets.

Rob

#19 March 24th 04, 07:41 PM

Rob Seaman wrote:

Tom McGlynn writes:

I think there are two different things that are getting confused
in this E-mail discussion.

That's precisely my point.
....
.. If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means? If not, why do we care if
an obscure set of keyword names are reserved at all?

Maybe you don't... The FITS standard doesn't discuss
what observer or origin means other than in the broadest
terms. In the context of this newsgroup I don't
think it is possible to get agreement beyond that. As far
as whether it is possible to have useful discussion without
including the syntax, I'd suggest that understanding where
the goes and whether it is a scalar or vector value are
important issues where there has been substantial discussion.

The FITS standards process is precisely the way to encourage conforming
usage. Arnold's message described a mechanism that sounded very
useful. NOAO is actively (very actively) pursuing a rich archive
facility for our large variety of high value astronomical data. Any
mechanism that can leverage the value of our data will be gratefully
adopted. If there are multiple astronomical naming conventions, we
may well support more than one. Why shouldn't these separate IDs
with separate semantics resulting from the separate constraints of
separate requirements be hosted in separate keywords?

So if, for example, NOAO were to adopt a different syntax and style
for the dataset IDs, for good and sufficient reasons of their own, then
they could use the same keyword or keywords and go ahead on their own.

There is an assumption here that the simple keyword being proposed will
successfully map onto an entirely different ID model. I suspect the
NOAO IDs (that do need to include the details that Tom seems to find
unpersuasive) will require several keywords. We'll likely populate a
large number of NSAxxxxx keywords with all sorts of info. Not all
FITS keyword usage has to be explicitly covered under the standard
(although a community wide keyword dictionary would be gratefully
received).

It would be desirable in this case if it was possible to distinguish
the different syntaxes used. Regardless I think it would be better
to have a standard place to look for the IDs than for software to
have to look for a list of keywords and see if there was ADSID or
ADECID or NOAOID or NRAOID or CDSID or ...

So, you're basically suggesting that software loop over all DSIDnnnn
to locate all the dataset identifiers. This may be a useful feature.
On the other hand, I haven't come up with a reason that I would need
to look for an identifier whose namespace I wasn't already interested
in.

You might not. But suppose someone builds a general service
that transforms IDs (from any origin) into pointers. Users might then build
clients of this service that uses the links that are returned. However
if they don't know where the ID information is stored in the FITS header
they have to pass the entire headers of each extension in the file for the
remote service to parse. Going the other direction, when users ingest a
set of FITS files from heterogenous sources, they may well want to extract
the dataset id as they ingest the files. They don't want to have to
update software to check for new keywords every time a new authority comes online.
It would be a lot easier if there is a nominal location for the dataset ID.

A simple keyword query will return ADSID (posited to be of general
community wide interest) and a second query will return NOAOID (of
specific interest only to NOAO staff and users). My software can
then generate a report or whatever that ties the two together. Give
me a use case for needing to retrieve a long list of opaque identifiers
related to projects completely outside my bailiwick.

Meanwhile, the DSIDnnnn scheme will require a potentially very expensive
traversal of several keywords in every header being considered. Imagine
mapping your header keywords to DB schema. Isn't your DB simply going
to contain a column (perhaps a vector) named ADS_ID? Our DB will certainly
contain a column named something like NOAO_ID. What is the value in
piling up a bunch of unrelated information under the same keyword
heading?

While I'm sure there might be exceptions, I'd hope that generally there would
be a single set of IDs maintained by a single institution. Having multiple
sites responsible for independent sets of IDs might occasionally be
necessary, but I don't think that's what we want to encourage.

We would certainly want the NOAO to be maintaining the IDs for the datasets
in its domain. But the NOAO should not be constrained regarding the format of the IDs...
Which is why I don't wish to put any significant constraint on the syntax of the ID.

The whole notion of keyword=value pairs is that the keyword identity
supplies some of the information. When I want the date of an
observation, I query DATE-OBS, not a list of DATEnnnn keywords,
searching for a string matching "OBS/20040324T170325Z". In effect,
the DSIDnnnn scheme asserts that our users won't find any direct use
for the dataset identifiers, otherwise we wouldn't make it so hard for
them to get at them. Instead of:

cl hselect *.fits adsid yes
"this is an ADS ID string"

they would have to do something like:

cl for (i=1; i=3; i+=1) {
hselect ("test*.fits", "dsid000"//i, yes) | match ADS
}
"this is an ADS ID string"

Not at all... Forgive my lack of knowledge of IRAF, but if you have
an ADSID and an NOAOID then we have to coordinate them anyway or the user
is going to have to write code like

if (thereIsAnADSID) then
use the ADSID
else if (thereIsAnNOAOID) then
use the NOAOID
else if (thereIsaCDSID) then
use the CDSID
...
and every time we get a new ID authority we have to add another
test.

I much prefer
if (thereIsaDSID) then
call theDSIDResolver()

This doesn't eliminate the switch statement above. It's
just moved into theDSIDResolver, but in networked world
that's very likely not to be a Web service that many users
invoke so the impact of a new kind of ID is much less and
most people's software accommodates it with no changes.

E.g., I don't need to worry about handling resolution
of new object names as they are added to astronomical nomenclature.
I send NED and SIMBAD the strings and they do the resoluiton
for me. The same would occur with the ID resolvers. Other
than sending servers essentially the complete FITS headers this
approach doesn't work if providers all use their own keywords
to store the ids.

Of course if the NOAOID is used purely internally it is of no interest
to the discussion. I am assuming that the NOAOID is an ID of interest
to users other than NOAO itself. Everyone is always free to define
their own IDs for their internal usage.

I'm sure you can see the usage issues immediately. Here are just two.
What about DSIDnnnn values that contain a substring matching one of the
supported naming authorities (or one that is added in the future)?
How is the search truncated when you don't know that there are exactly
three keywords to start? Sure, a programmer can work around each of
these - but we add keywords for the benefit of our unsophisticated
users, too :-)

The same issue crops up if you use different keywords or encode something
in the value. The advantage of putting it in the value, is that software
knows where in the header to find the information it needs to start with.

This also begs the question of identifiers for individual FITS HDUs.
A particular FITS file or HDU may belong to multiple datasets. A
particular HDU has a single identity, however. Shouldn't part of
this discussion include how to supply a community wide identifier
for each separate FITS object? Imagine starting with a dataset ID.
Doesn't that set ID have to coexist with some mechanism for referencing
all of its many members?

The issue of identity has been extensively discussed in the Virtual Observatory
community. The suggested ADEC convention is compatible with the outcome
of that discussion, but the outcome was essentially that this is not
something that can be solved generally. Resolving the IDs into links
to the entire dataset is certainly something that we want. The current
ADEC service does this. It you were to build NOAO ID's I certainly
hope that you would provide such a service.

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation.

Ah! To return to Lucio's contribution:

My personal tendency (but I'm an end user and not an archive mantainer
in this context) would have been to put part of the information in
directory names and not in file names (e.g. for my own BeppoSAX
analysis I used to store files as [A']/[B']/[C]/[D].type,

A familiar issue is how to tie an archive's data stores together with
its metadata DB. NOAO is specifically considering precisely this
directory tree structure for our raw data store and also how to tie
it into the resulting headers/DB.

Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space. Time is key to groundbased observations because access to our
telescopes (and the resulting proprietary ties that bind) is distributed
via the calendar and clock.

And I have no problem with including the time (or any string
a user chooses) in the dataset ID. I just don't see why the
proposal needs to mandate it, or even worry about that level
of detail.

It's been my experience though that when building tables, its very convenient to
have a simple unique key -- even if its completely arbitrary -- rather
than building it up by concatenating enough elements in the table to
make each entry unique. It sounds to me like that's what you are doing
here, but if it works for you that's fine with me. I'm not trying to
suggest you use any given approach.

This is precisely why different dataset IDs might require very different
FITS support. IDs generated for tying publications to data, for instance,
are likely going to be very different than IDs generated for tying data
objects to telescopes or archives.

Maybe, but if we allow multiple IDs for a given element I don't see
why that matters.

Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization. I am very supportive
of taking the very long term view of provenance. Over the very long
term, perhaps the fact that an entity known as NOAO used to own the
data may no longer matter. Perhaps a particular national observatory
will no longer exist because a particular nation will no longer exist :-)
(In his salad days, a college professor of mine set up the Iranian
National Observatory before the Shah fell...) In the long term we're
all dead :-)

But we hope our data are not! Again this seems to be saying that
we need to cram things into the data id so that it serves as a mini-description
of the dataset. I'm not keen on that approach, but it seems
to be easy to accommodate within the very broad context I'm
suggesting is all we should try to agree on.

However, NOAO's current need is precisely to consider who owns what and
who may have access to data and precisely when. Your mileage may vary -
which is what says to me that an ADS ID scheme should be placed in an
ADS branded keyword. It isn't that I have an issue with ADS dataset
IDs - far from it. I have an issue with a single style of dataset ID
coopting the entire notion of placing data within sets.

I guess that's what confuses me most... All I suggesting we
agree on (right now at least) is the keyword names. I'm explicitly not
advocating for the ADEC style -- though I think it can accommodate
much if not all of what you would like to do.

Regards,
Tom

#20 March 24th 04, 07:41 PM

Rob Seaman wrote:

Tom McGlynn writes:

I think there are two different things that are getting confused
in this E-mail discussion.

That's precisely my point.
....
.. If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means? If not, why do we care if
an obscure set of keyword names are reserved at all?

Maybe you don't... The FITS standard doesn't discuss
what observer or origin means other than in the broadest
terms. In the context of this newsgroup I don't
think it is possible to get agreement beyond that. As far
as whether it is possible to have useful discussion without
including the syntax, I'd suggest that understanding where
the goes and whether it is a scalar or vector value are
important issues where there has been substantial discussion.

The FITS standards process is precisely the way to encourage conforming
usage. Arnold's message described a mechanism that sounded very
useful. NOAO is actively (very actively) pursuing a rich archive
facility for our large variety of high value astronomical data. Any
mechanism that can leverage the value of our data will be gratefully
adopted. If there are multiple astronomical naming conventions, we
may well support more than one. Why shouldn't these separate IDs
with separate semantics resulting from the separate constraints of
separate requirements be hosted in separate keywords?

So if, for example, NOAO were to adopt a different syntax and style
for the dataset IDs, for good and sufficient reasons of their own, then
they could use the same keyword or keywords and go ahead on their own.

There is an assumption here that the simple keyword being proposed will
successfully map onto an entirely different ID model. I suspect the
NOAO IDs (that do need to include the details that Tom seems to find
unpersuasive) will require several keywords. We'll likely populate a
large number of NSAxxxxx keywords with all sorts of info. Not all
FITS keyword usage has to be explicitly covered under the standard
(although a community wide keyword dictionary would be gratefully
received).

It would be desirable in this case if it was possible to distinguish
the different syntaxes used. Regardless I think it would be better
to have a standard place to look for the IDs than for software to
have to look for a list of keywords and see if there was ADSID or
ADECID or NOAOID or NRAOID or CDSID or ...

So, you're basically suggesting that software loop over all DSIDnnnn
to locate all the dataset identifiers. This may be a useful feature.
On the other hand, I haven't come up with a reason that I would need
to look for an identifier whose namespace I wasn't already interested
in.

You might not. But suppose someone builds a general service
that transforms IDs (from any origin) into pointers. Users might then build
clients of this service that uses the links that are returned. However
if they don't know where the ID information is stored in the FITS header
they have to pass the entire headers of each extension in the file for the
remote service to parse. Going the other direction, when users ingest a
set of FITS files from heterogenous sources, they may well want to extract
the dataset id as they ingest the files. They don't want to have to
update software to check for new keywords every time a new authority comes online.
It would be a lot easier if there is a nominal location for the dataset ID.

A simple keyword query will return ADSID (posited to be of general
community wide interest) and a second query will return NOAOID (of
specific interest only to NOAO staff and users). My software can
then generate a report or whatever that ties the two together. Give
me a use case for needing to retrieve a long list of opaque identifiers
related to projects completely outside my bailiwick.

Meanwhile, the DSIDnnnn scheme will require a potentially very expensive
traversal of several keywords in every header being considered. Imagine
mapping your header keywords to DB schema. Isn't your DB simply going
to contain a column (perhaps a vector) named ADS_ID? Our DB will certainly
contain a column named something like NOAO_ID. What is the value in
piling up a bunch of unrelated information under the same keyword
heading?

While I'm sure there might be exceptions, I'd hope that generally there would
be a single set of IDs maintained by a single institution. Having multiple
sites responsible for independent sets of IDs might occasionally be
necessary, but I don't think that's what we want to encourage.

We would certainly want the NOAO to be maintaining the IDs for the datasets
in its domain. But the NOAO should not be constrained regarding the format of the IDs...
Which is why I don't wish to put any significant constraint on the syntax of the ID.

The whole notion of keyword=value pairs is that the keyword identity
supplies some of the information. When I want the date of an
observation, I query DATE-OBS, not a list of DATEnnnn keywords,
searching for a string matching "OBS/20040324T170325Z". In effect,
the DSIDnnnn scheme asserts that our users won't find any direct use
for the dataset identifiers, otherwise we wouldn't make it so hard for
them to get at them. Instead of:

cl hselect *.fits adsid yes
"this is an ADS ID string"

they would have to do something like:

cl for (i=1; i=3; i+=1) {
hselect ("test*.fits", "dsid000"//i, yes) | match ADS
}
"this is an ADS ID string"

Not at all... Forgive my lack of knowledge of IRAF, but if you have
an ADSID and an NOAOID then we have to coordinate them anyway or the user
is going to have to write code like

if (thereIsAnADSID) then
use the ADSID
else if (thereIsAnNOAOID) then
use the NOAOID
else if (thereIsaCDSID) then
use the CDSID
...
and every time we get a new ID authority we have to add another
test.

I much prefer
if (thereIsaDSID) then
call theDSIDResolver()

This doesn't eliminate the switch statement above. It's
just moved into theDSIDResolver, but in networked world
that's very likely not to be a Web service that many users
invoke so the impact of a new kind of ID is much less and
most people's software accommodates it with no changes.

E.g., I don't need to worry about handling resolution
of new object names as they are added to astronomical nomenclature.
I send NED and SIMBAD the strings and they do the resoluiton
for me. The same would occur with the ID resolvers. Other
than sending servers essentially the complete FITS headers this
approach doesn't work if providers all use their own keywords
to store the ids.

Of course if the NOAOID is used purely internally it is of no interest
to the discussion. I am assuming that the NOAOID is an ID of interest
to users other than NOAO itself. Everyone is always free to define
their own IDs for their internal usage.

I'm sure you can see the usage issues immediately. Here are just two.
What about DSIDnnnn values that contain a substring matching one of the
supported naming authorities (or one that is added in the future)?
How is the search truncated when you don't know that there are exactly
three keywords to start? Sure, a programmer can work around each of
these - but we add keywords for the benefit of our unsophisticated
users, too :-)

The same issue crops up if you use different keywords or encode something
in the value. The advantage of putting it in the value, is that software
knows where in the header to find the information it needs to start with.

This also begs the question of identifiers for individual FITS HDUs.
A particular FITS file or HDU may belong to multiple datasets. A
particular HDU has a single identity, however. Shouldn't part of
this discussion include how to supply a community wide identifier
for each separate FITS object? Imagine starting with a dataset ID.
Doesn't that set ID have to coexist with some mechanism for referencing
all of its many members?

The issue of identity has been extensively discussed in the Virtual Observatory
community. The suggested ADEC convention is compatible with the outcome
of that discussion, but the outcome was essentially that this is not
something that can be solved generally. Resolving the IDs into links
to the entire dataset is certainly something that we want. The current
ADEC service does this. It you were to build NOAO ID's I certainly
hope that you would provide such a service.

Here I think you are confusing the metadata describing an observation
with the 'name' of an observation.

Ah! To return to Lucio's contribution:

My personal tendency (but I'm an end user and not an archive mantainer
in this context) would have been to put part of the information in
directory names and not in file names (e.g. for my own BeppoSAX
analysis I used to store files as [A']/[B']/[C]/[D].type,

A familiar issue is how to tie an archive's data stores together with
its metadata DB. NOAO is specifically considering precisely this
directory tree structure for our raw data store and also how to tie
it into the resulting headers/DB.

Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space. Time is key to groundbased observations because access to our
telescopes (and the resulting proprietary ties that bind) is distributed
via the calendar and clock.

And I have no problem with including the time (or any string
a user chooses) in the dataset ID. I just don't see why the
proposal needs to mandate it, or even worry about that level
of detail.

It's been my experience though that when building tables, its very convenient to
have a simple unique key -- even if its completely arbitrary -- rather
than building it up by concatenating enough elements in the table to
make each entry unique. It sounds to me like that's what you are doing
here, but if it works for you that's fine with me. I'm not trying to
suggest you use any given approach.

This is precisely why different dataset IDs might require very different
FITS support. IDs generated for tying publications to data, for instance,
are likely going to be very different than IDs generated for tying data
objects to telescopes or archives.

Maybe, but if we allow multiple IDs for a given element I don't see
why that matters.

Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization. I am very supportive
of taking the very long term view of provenance. Over the very long
term, perhaps the fact that an entity known as NOAO used to own the
data may no longer matter. Perhaps a particular national observatory
will no longer exist because a particular nation will no longer exist :-)
(In his salad days, a college professor of mine set up the Iranian
National Observatory before the Shah fell...) In the long term we're
all dead :-)

But we hope our data are not! Again this seems to be saying that
we need to cram things into the data id so that it serves as a mini-description
of the dataset. I'm not keen on that approach, but it seems
to be easy to accommodate within the very broad context I'm
suggesting is all we should try to agree on.

However, NOAO's current need is precisely to consider who owns what and
who may have access to data and precisely when. Your mileage may vary -
which is what says to me that an ADS ID scheme should be placed in an
ADS branded keyword. It isn't that I have an issue with ADS dataset
IDs - far from it. I have an issue with a single style of dataset ID
coopting the entire notion of placing data within sets.

I guess that's what confuses me most... All I suggesting we
agree on (right now at least) is the keyword names. I'm explicitly not
advocating for the ADEC style -- though I think it can accommodate
much if not all of what you would like to do.

Regards,
Tom

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
digest 2453183	Frederick Shorts	Astronomy Misc	3	July 1st 04 08:29 PM
[fitsbits] Dataset identifications.	Jonathan McDowell	FITS	3	March 12th 04 03:57 PM
[fitsbits] Dataset identifications.	Thierry Forveille	FITS	12	March 12th 04 02:33 PM
[fitsbits] Dataset identifications.	Thomas McGlynn	FITS	0	March 10th 04 07:20 PM
antagonist's digest, volume 2452854	dizzy	Astronomy Misc	4	August 7th 03 01:02 AM