SpaceBanter.com - View Single Post - [fitsbits] 'Dataset Identifications' postings (digest)

#26 March 25th 04, 10:30 AM

Let me answer to a bunch of messages in one go.

(By the way, if you've not already guessed, the messages tagged "LC's
Nospam ..." are by me as well, it depends if I use fitsbits or post the
NG ; in general I use the mailing list for longer messages)

From: Rob Seaman
Date: Tue, 23 Mar 2004 21:54:40 +0000 (UTC)

You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt

A very interesting list. ....

Indeed.

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

Maybe, that's why I found it quite natural for me ...
.... although I've some reservations just on the "satellite" subset, namely

is a Sa:spacecraft enough to define where the dataset is archived ?

Surely yes if there is a single archive managed by a single space
agency or its "contractor".

Possibly yes if the satellite is a cooperation between different
agencies, AND they have agreed to run the same pipeline AND to keep
mirror sites

May fail if different agencies, organizations or institutes decide to
run different pipelines on the same data ! Resulting in two separate
datasets stemming from the same raw data.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,

I guess it does not matter at all who owns the data rights for the period
during which the data are not public. If the data are to be indexed, it
means they are public ... either in some official archive or possibly in
some private one.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency

Just for the sake of argument, a "funding agency" is not necessarily
associated with a single nation, at least this side of the Atlantic (ESA,
ESO) ... or of the Panama canal (ESO again :-) ).

Observatory
Consortium member ("partner")

The latter is hardly relevant to the identification of the dataset

Telescope
Instrument

these and the above are (loosely) covered by ORIGIN, TELESCOP, INSTRUME,
or other keywords which may be in the same FITS file, or (as said by
others already) in some database at the archive site

Date&Time
Proposal ID
PI and/or project ID

The latter two might be used inside the dataset identifier, or as pointers
to locate the data, internally by the archiving organization. But what is
"inside" is not our business. Similarly the date might be used in the
identifier, again none of our business.

I agree that usually an "observational" (i.e. not "multi-observation"
dataset may be linked to a single date, although the reverse is not
necessarily true. I mean I forgot one case in the examples in my previous
posting, i.e. the third below :

- ground based observatories typically observe on position of the sky
from one instrument at one telescope at a time

- space observatories often observe a position of the sky from SEVERAL
coaxial (although different FoV size) instruments/telescopes on the
satellite (and for me this is ONE dataset)

- however sometimes there are non-coaxial instruments. I take the case
of BeppoSAX, where during each OP (Observing Period) one had 2-3
different FOTs (datasets) : one for the NFIs (Narrow Field Instruments)
pointing along the Z axis, and one each for the two WFCs (Wide Field
Cameras) pointing along +Y and -Y (maybe just one was on). I guess
RXTE with the ASM has something similar.

-----------------------------------------------------------------------
From: Thomas McGlynn
Date: Wed, 24 Mar 2004 10:11:37 -0500

[...] any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined

Unfortunately also some aspects of the semantics are ill-defined (see
discussions done at different times). May be it would be better to precise
usage a bit more.

Although most details (including some I've raised) are out of scope
indeed.

We should for instance state that the keyword is a string, and that the
first substring from the beginning to the first slash defines a namespace,
while the rest of the content is defined by the authority managing such
namespace.

We should also indicate the perspective usage, which is still not totally
clear to me (see below).

So I see the discussion about where such a keyword would go,

I.e. in primary header, in each extension header, in some extension header

whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could

Do you mean multiple occurrences of the same keyword (like HISTORY or
COMMENT) or breaking a single long string value in continuation keywords ?

to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).

See below on "vector"

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.

Except for the above notion of namespace, and for a possibility to define
that it should be a string contained in a SINGLE keyword (that would limit
its length to 68 characters).

From: Rob Seaman
Date: Wed, 24 Mar 2004 17:22:31 +0000 (UTC)

It may well be that all astronomical semantic discussions should now
happen under the happy VO umbrella. Personally, I think FITS has too
often skirted the difficult issues. If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means? If not, why do we care if
an obscure set of keyword names are reserved at all?

That would avoid the loose situation we have for ORIGIN etc.

My own read on this part of the discussion is that most people would
want to see the ID repeated in all relevant HDU's

Yes.

My personal inclination (as an extremist Ockhamist) is that keywords shall
not be multiplied praeter necessitatem. So I would tend to put one (set
of) keyword(s) in the primary header if they apply to all the file, and to
put it in the extensions only when they differ.

and that there probably needs to be at least an option for the id to
be a vector value.

If by vector, you mean repeated keywords from the same or different ID
families, I agree. IDs are long strings. Won't fit many in 80 chars.

It would also be possible to impose a syntax limitation that each
identifier is limited to the space of a single kwd (68 characters
excluding the DSIDENT ='...').

If the given file (or HDU) "belongs" (or "refers" ? see below) to more
than one dataset at the same time and with equal rank, one could allow for
repeated DSIDENT kwds (like COMMENT, HISTORY).

However one may need a sequence of DSIDnn if either :

- the file "belongs" or "refers" to different datasets with some
priority or ranking order

- one wants to keep track of an history : i.e. this file belongs to
the dataset I reduced (DSID01), I started my reduction from the
result of the pipeline provided by the xyz archive centre (DSID02),
which used the raw data of the given observation taken with the uvw
telescope/satellite (DSID03)

Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space. Time is key to groundbased observations because access to our

Also for satellites. Time is relevant because it's related to scheduling.
But that does not mean it has (or has not) to be part of the id. See
above. None of our business.

Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization. I am very supportive

The identifier will just say "go to this site to eventually retrieve the
dataset". It's up to the site to then say "this dataset is not yet
public", to protect it with a password, or whatever.

From: Arnold Rots
Date: Wed, 24 Mar 2004 15:47:57 -0500 (EST)

The scope of Tom's proposal is really quite limited:

He is announcing the establishment of a convention that employs
a keyword (DS_IDENT) or set of keywords (DS_IDiii).
The intent is that the value of that keyword contains a label or key
that will allow users to obtain a pointer to a particular volume in
astronomical data space. No less, but also no more.

just a little bit more

Within the space of data identifier strings only the subspace of
strings starting with "ADS/" (case-insensitive!) is reserved.

I believe you should reserve also the fact that the first part of the id
is the namespace, and delegate all the rest to the namespace authority.

May be one should also add another kwd (DSAUTHOR) which points to an URL
of the namespace authority.

Or are we imagining something like the DNS with a set of "root
nameservers" ?

and purposes. For the Chandra Data Archive what you will get in
response to the key is a URL that will allow you to request a download
of data products associated with a particular observation - or maybe a
set of observations. If you try again next month, the files may be
different: we may have reprocessed or decided to add some products to
the package.

Hmmm ... I'm a bit worried by the fact that the dataset may change. Maybe
that's why it is not yet so clear to me what usage an user will do of the
dataset identifier. Let's make some examples.

a) I read a paper, which tells me "the data used here belong to dataset
xyz". I want to repeat the analysis of the SAME data myself, so I
use the id to retrieve the data. Obviously here I want to get the
SAME data, not a further and better version (do I ?).

No FITS file involved here though on the user end.

b) I retrieve the files, and I want to check they really belong to the
correct dataset.

c) I have got somehow some files, and I want to know to what observation
do they refer, or to retrieve more files of the same dataset, or to
find what papers have been published using them.

d) I do my analysis and produce some more files. These are private, but
I may want to document that the starting point of the analysis was
the given dataset. But DS-IDENT is not the right way, my data DO NOT
belong to the dataset, I need a separate history kwd ...

... if I'd ever distribute the data (I suppose I also have to quote
the DS-IDENT in any paper I will write, for the ADS to use it)

Again, think of the dataset identifier as a key that allows the user
to obtain a pointer to the dataset. There is no need to encode any
information in it - nor is that prohibited

Agreed

The list of informational metadata that Rob provided looks to me more
like metadata that ought to reside in a database.

(or in other keywords in the same file if desired)

--
----------------------------------------------------------------------
is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.