[fitsbits] 'Dataset Identifications' postings (digest) - Page 3

#21

March 24th 04, 08:47 PM

Arnold Rots

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

The scope of Tom's proposal is really quite limited:

He is announcing the establishment of a convention that employs
a keyword (DS_IDENT) or set of keywords (DS_IDiii).
The intent is that the value of that keyword contains a label or key
that will allow users to obtain a pointer to a particular volume in
astronomical data space. No less, but also no more.

Within the space of data identifier strings only the subspace of
strings starting with "ADS/" (case-insensitive!) is reserved.

That's really all; you can stop reading here. But if the subject
fascinates you, you may read on.

Anybody who wants to participate in that subspace needs to know a little
more (like the substring between the first '/' and the first '#' that
represents the facility from which the datasets originated, and the
fact that that facility is free in choosing its definition of what a
dataset is and the encoding of everything after the first '#'), but
that is not particularly relevant for this newsgroup.

"volume in data space" or "dataset" is left vague because it is up to
the issuing facility to decide what makes the most sense for its users
and purposes. For the Chandra Data Archive what you will get in
response to the key is a URL that will allow you to request a download
of data products associated with a particular observation - or maybe a
set of observations. If you try again next month, the files may be
different: we may have reprocessed or decided to add some products to
the package. For other archives it may be a specific file. For the
ADS itself, you may think of an OID as the label and a journal article
as the dataset.

There is no intent to prescribe the syntax or the semantics of the
identifiers. And there certainly is no intent to imply any kind of
inheritance or propagation of identifiers to the user level.

Again, think of the dataset identifier as a key that allows the user
to obtain a pointer to the dataset. There is no need to encode any
information in it - nor is that prohibited (the ADS could use bibcodes
as identifiers).

The list of informational metadata that Rob provided looks to me more
like metadata that ought to reside in a database. And if you wanted
to know what all the values are, you would take the NOAO dataset key
and query:

select * from RobsDatabase where DS_IDENT='NOAOidentifier';

where NOAOidentifier is something short and, possibly, random, rather
than trying to decode information from a string that stretches over
many header rows. Again in Chandra archive language, if you want to
browse that kind of information, you come to our web browser that will
query the database that contains the observation catalog; it will tell
you about objects, coordinates, times, observers, instruments,
proprietary times, public release dates, etc.

- Arnold

--------------------------------------------------------------------------
Arnold H. Rots Chandra X-ray Science Center
Smithsonian Astrophysical Observatory tel: +1 617 496 7701
60 Garden Street, MS 67 fax: +1 617 495 7356
Cambridge, MA 02138
USA http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------

Report this post as spam, offensive or inappropriate

#22

March 24th 04, 08:47 PM

Arnold Rots

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

#23

March 25th 04, 08:02 AM

Clive Page

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

On Wed, 24 Mar 2004, Arnold Rots wrote:

The scope of Tom's proposal is really quite limited:

I would like to support Tom's simple and basic proposal. As he says, a
unique key is of great value in building and searching databases, and
there are a whole lot of cases in which the DS_IDENT will be immediately
usable and useful. Of course we can all think of special cases in
which it will not be quite enough, but (in my opinion) this doesn't
invalidate the basic idea, which would be severely compromised by being
made more complicated. Let's keep it simple.

--
Clive Page
Dept of Physics & Astronomy,
University of Leicester,
Leicester, LE1 7RH, U.K.

#24

March 25th 04, 08:02 AM

Clive Page

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

#25

March 25th 04, 10:30 AM

Lucio Chiappetti

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

Let me answer to a bunch of messages in one go.

(By the way, if you've not already guessed, the messages tagged "LC's
Nospam ..." are by me as well, it depends if I use fitsbits or post the
NG ; in general I use the mailing list for longer messages)

From: Rob Seaman
Date: Tue, 23 Mar 2004 21:54:40 +0000 (UTC)

You will find the current list at:
http://vo.ads.harvard.edu/dv/facilities.txt

A very interesting list. ....

Indeed.

There appears to be a confusion between a ground-based observing site
and an observatory - perhaps this is a result of the list being compiled
by our friends in the space-based astronomical community?

Maybe, that's why I found it quite natural for me ...
.... although I've some reservations just on the "satellite" subset, namely

is a Sa:spacecraft enough to define where the dataset is archived ?

Surely yes if there is a single archive managed by a single space
agency or its "contractor".

Possibly yes if the satellite is a cooperation between different
agencies, AND they have agreed to run the same pipeline AND to keep
mirror sites

May fail if different agencies, organizations or institutes decide to
run different pipelines on the same data ! Resulting in two separate
datasets stemming from the same raw data.

In general an observatory is a political entity, a telescope is a facility,
and a site like Kitt Peak is a piece of real estate that may be host
multiple facilities from multiple observatories. Depending on the details
of contracts or other binding operating agreements, an observatory may
"own" the data that result from a particular facility like a telescope,

I guess it does not matter at all who owns the data rights for the period
during which the data are not public. If the data are to be indexed, it
means they are public ... either in some official archive or possibly in
some private one.

A dataset ID can be a relatively simple beast - perhaps as simple as
a data source ID and a serial number. But the full taxonomy of dataset
provenance has to support many degrees of freedom. At the very least:

Nation
Funding agency

Just for the sake of argument, a "funding agency" is not necessarily
associated with a single nation, at least this side of the Atlantic (ESA,
ESO) ... or of the Panama canal (ESO again :-) ).

Observatory
Consortium member ("partner")

The latter is hardly relevant to the identification of the dataset

Telescope
Instrument

these and the above are (loosely) covered by ORIGIN, TELESCOP, INSTRUME,
or other keywords which may be in the same FITS file, or (as said by
others already) in some database at the archive site

Date&Time
Proposal ID
PI and/or project ID

The latter two might be used inside the dataset identifier, or as pointers
to locate the data, internally by the archiving organization. But what is
"inside" is not our business. Similarly the date might be used in the
identifier, again none of our business.

I agree that usually an "observational" (i.e. not "multi-observation"
dataset may be linked to a single date, although the reverse is not
necessarily true. I mean I forgot one case in the examples in my previous
posting, i.e. the third below :

- ground based observatories typically observe on position of the sky
from one instrument at one telescope at a time

- space observatories often observe a position of the sky from SEVERAL
coaxial (although different FoV size) instruments/telescopes on the
satellite (and for me this is ONE dataset)

- however sometimes there are non-coaxial instruments. I take the case
of BeppoSAX, where during each OP (Observing Period) one had 2-3
different FOTs (datasets) : one for the NFIs (Narrow Field Instruments)
pointing along the Z axis, and one each for the two WFCs (Wide Field
Cameras) pointing along +Y and -Y (maybe just one was on). I guess
RXTE with the ASM has something similar.

-----------------------------------------------------------------------
From: Thomas McGlynn
Date: Wed, 24 Mar 2004 10:11:37 -0500

[...] any specific syntax used. E.g., in FITS today we have keywords
ORIGIN, TELESCOP, INSTRUME and OBSERVER where the general semantics of
the keyword is specified, but the format is completely undefined

Unfortunately also some aspects of the semantics are ill-defined (see
discussions done at different times). May be it would be better to precise
usage a bit more.

Although most details (including some I've raised) are out of scope
indeed.

We should for instance state that the keyword is a string, and that the
first substring from the beginning to the first slash defines a namespace,
while the rest of the content is defined by the authority managing such
namespace.

We should also indicate the perspective usage, which is still not totally
clear to me (see below).

So I see the discussion about where such a keyword would go,

I.e. in primary header, in each extension header, in some extension header

whether we need a keyword that allows for multiple values
(which DS_IDENT would not) as the kind of things we could

Do you mean multiple occurrences of the same keyword (like HISTORY or
COMMENT) or breaking a single long string value in continuation keywords ?

to be at least an option for the id to be a vector value. The
later requirement mandates a shorter keyword (perhaps just DSID).

See below on "vector"

However, I do not think that this is the appropriate forum
for discussion of a particular syntax for the value of this keyword.

Except for the above notion of namespace, and for a possibility to define
that it should be a string contained in a SINGLE keyword (that would limit
its length to 68 characters).

From: Rob Seaman
Date: Wed, 24 Mar 2004 17:22:31 +0000 (UTC)

It may well be that all astronomical semantic discussions should now
happen under the happy VO umbrella. Personally, I think FITS has too
often skirted the difficult issues. If we are to debate reserving
DSIDnnnn for something called "dataset identifiers", isn't it
appropriate to address what that means? If not, why do we care if
an obscure set of keyword names are reserved at all?

That would avoid the loose situation we have for ORIGIN etc.

My own read on this part of the discussion is that most people would
want to see the ID repeated in all relevant HDU's

Yes.

My personal inclination (as an extremist Ockhamist) is that keywords shall
not be multiplied praeter necessitatem. So I would tend to put one (set
of) keyword(s) in the primary header if they apply to all the file, and to
put it in the extensions only when they differ.

and that there probably needs to be at least an option for the id to
be a vector value.

If by vector, you mean repeated keywords from the same or different ID
families, I agree. IDs are long strings. Won't fit many in 80 chars.

It would also be possible to impose a syntax limitation that each
identifier is limited to the space of a single kwd (68 characters
excluding the DSIDENT ='...').

If the given file (or HDU) "belongs" (or "refers" ? see below) to more
than one dataset at the same time and with equal rank, one could allow for
repeated DSIDENT kwds (like COMMENT, HISTORY).

However one may need a sequence of DSIDnn if either :

- the file "belongs" or "refers" to different datasets with some
priority or ranking order

- one wants to keep track of an history : i.e. this file belongs to
the dataset I reduced (DSID01), I started my reduction from the
result of the pipeline provided by the xyz archive centre (DSID02),
which used the raw data of the given observation taken with the uvw
telescope/satellite (DSID03)

Why should an ID have the time?

Astronomers have too often relied on convoluted filenames to convey the
placement of a specific data file within some multidimensional parameter
space. Time is key to groundbased observations because access to our

Also for satellites. Time is relevant because it's related to scheduling.
But that does not mean it has (or has not) to be part of the id. See
above. None of our business.

Why does it need a proposal ID, nation, agency?

Our need for a dataset identifier is precisely to implement the
proprietary policies of our current organization. I am very supportive

The identifier will just say "go to this site to eventually retrieve the
dataset". It's up to the site to then say "this dataset is not yet
public", to protect it with a password, or whatever.

From: Arnold Rots
Date: Wed, 24 Mar 2004 15:47:57 -0500 (EST)

The scope of Tom's proposal is really quite limited:

He is announcing the establishment of a convention that employs
a keyword (DS_IDENT) or set of keywords (DS_IDiii).
The intent is that the value of that keyword contains a label or key
that will allow users to obtain a pointer to a particular volume in
astronomical data space. No less, but also no more.

just a little bit more

Within the space of data identifier strings only the subspace of
strings starting with "ADS/" (case-insensitive!) is reserved.

I believe you should reserve also the fact that the first part of the id
is the namespace, and delegate all the rest to the namespace authority.

May be one should also add another kwd (DSAUTHOR) which points to an URL
of the namespace authority.

Or are we imagining something like the DNS with a set of "root
nameservers" ?

and purposes. For the Chandra Data Archive what you will get in
response to the key is a URL that will allow you to request a download
of data products associated with a particular observation - or maybe a
set of observations. If you try again next month, the files may be
different: we may have reprocessed or decided to add some products to
the package.

Hmmm ... I'm a bit worried by the fact that the dataset may change. Maybe
that's why it is not yet so clear to me what usage an user will do of the
dataset identifier. Let's make some examples.

a) I read a paper, which tells me "the data used here belong to dataset
xyz". I want to repeat the analysis of the SAME data myself, so I
use the id to retrieve the data. Obviously here I want to get the
SAME data, not a further and better version (do I ?).

No FITS file involved here though on the user end.

b) I retrieve the files, and I want to check they really belong to the
correct dataset.

c) I have got somehow some files, and I want to know to what observation
do they refer, or to retrieve more files of the same dataset, or to
find what papers have been published using them.

d) I do my analysis and produce some more files. These are private, but
I may want to document that the starting point of the analysis was
the given dataset. But DS-IDENT is not the right way, my data DO NOT
belong to the dataset, I need a separate history kwd ...

... if I'd ever distribute the data (I suppose I also have to quote
the DS-IDENT in any paper I will write, for the ADS to use it)

Again, think of the dataset identifier as a key that allows the user
to obtain a pointer to the dataset. There is no need to encode any
information in it - nor is that prohibited

Agreed

The list of informational metadata that Rob provided looks to me more
like metadata that ought to reside in a database.

(or in other keywords in the same file if desired)

--
----------------------------------------------------------------------
is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.

#26

March 25th 04, 10:30 AM

Lucio Chiappetti

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

#27

March 25th 04, 03:45 PM

Arnold Rots

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

Lucio Chiappetti wrote:
Let me answer to a bunch of messages in one go.

...

From: Arnold Rots
Date: Wed, 24 Mar 2004 15:47:57 -0500 (EST)

The scope of Tom's proposal is really quite limited:

He is announcing the establishment of a convention that employs
a keyword (DS_IDENT) or set of keywords (DS_IDiii).
The intent is that the value of that keyword contains a label or key
that will allow users to obtain a pointer to a particular volume in
astronomical data space. No less, but also no more.

just a little bit more

OK, just the following sentence.

Within the space of data identifier strings only the subspace of
strings starting with "ADS/" (case-insensitive!) is reserved.

I believe you should reserve also the fact that the first part of the id
is the namespace, and delegate all the rest to the namespace authority.

May be one should also add another kwd (DSAUTHOR) which points to an URL
of the namespace authority.

Or are we imagining something like the DNS with a set of "root
nameservers" ?

Nothing is implied or recommended by this proposal.
We took great pains to ensure that the ADS/ identifiers be conforming
with the standard being developed by the IVOA, but that is not part of
this proposal. Others may want to suggest further conventions tieing
the two together in the future, but this is not the time to do that -
for one thing, the IVOA standard has not yet been completed.

and purposes. For the Chandra Data Archive what you will get in
response to the key is a URL that will allow you to request a download
of data products associated with a particular observation - or maybe a
set of observations. If you try again next month, the files may be
different: we may have reprocessed or decided to add some products to
the package.

Hmmm ... I'm a bit worried by the fact that the dataset may change. Maybe
that's why it is not yet so clear to me what usage an user will do of the
dataset identifier. Let's make some examples.

a) I read a paper, which tells me "the data used here belong to dataset
xyz". I want to repeat the analysis of the SAME data myself, so I
use the id to retrieve the data. Obviously here I want to get the
SAME data, not a further and better version (do I ?).

No FITS file involved here though on the user end.

Do you just want to repeat the analysis or do you want to do a better
job? We would give you the current (best) set of data products based
on the same raw observational data, so you can do your (better) job.
If you can't reconcile results, you can ask us for the version that
was (most likely) used for the paper and we'll be happy to give it to
you, provided it was a "good" version.

b) I retrieve the files, and I want to check they really belong to the
correct dataset.

c) I have got somehow some files, and I want to know to what observation
do they refer, or to retrieve more files of the same dataset, or to
find what papers have been published using them.

This all goes by OBSID, not DS_IDENT, at least for us, although we
could make it work through the idenitifier as well.

d) I do my analysis and produce some more files. These are private, but
I may want to document that the starting point of the analysis was
the given dataset. But DS-IDENT is not the right way, my data DO NOT
belong to the dataset, I need a separate history kwd ...

Agreed.

... if I'd ever distribute the data (I suppose I also have to quote
the DS-IDENT in any paper I will write, for the ADS to use it)

That's the idea - or multiple identifiers.

Again, think of the dataset identifier as a key that allows the user
to obtain a pointer to the dataset. There is no need to encode any
information in it - nor is that prohibited

Agreed

The list of informational metadata that Rob provided looks to me more
like metadata that ought to reside in a database.

(or in other keywords in the same file if desired)

--
----------------------------------------------------------------------
is a newsreading account used by more persons to
avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.

_______________________________________________
fitsbits mailing list

http://listmgr.cv.nrao.edu/mailman/listinfo/fitsbits

--------------------------------------------------------------------------
Arnold H. Rots Chandra X-ray Science Center
Smithsonian Astrophysical Observatory tel: +1 617 496 7701
60 Garden Street, MS 67 fax: +1 617 495 7356
Cambridge, MA 02138
USA http://hea-www.harvard.edu/~arots/
--------------------------------------------------------------------------

#28

March 25th 04, 03:45 PM

Arnold Rots

external usenet poster

Posts: n/a

[fitsbits] 'Dataset Identifications' postings (digest)

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
digest 2453183	Frederick Shorts	Astronomy Misc	3	July 1st 04 08:29 PM
[fitsbits] Dataset identifications.	Jonathan McDowell	FITS	3	March 12th 04 03:57 PM
[fitsbits] Dataset identifications.	Thierry Forveille	FITS	12	March 12th 04 02:33 PM
[fitsbits] Dataset identifications.	Thomas McGlynn	FITS	0	March 10th 04 07:20 PM
antagonist's digest, volume 2452854	dizzy	Astronomy Misc	4	August 7th 03 01:02 AM