Re: A slight regression in 2.9.1 text:p loading...

From: Ben Martin <monkeyiq_at_users.sourceforge.net>
Date: Mon Jul 18 2011 - 10:31:37 CEST

Hi Martin,

  Unsurprisingly, this is actually a bit more complex than I originally
thought :) I was reading over the part of the ODF spec I cited, and
traces of charData() calls, and the bug that ran text together when
whitespace is eaten 12813.

A fairly simple fragment created with OO is the following, which should
all be on one line.

<text:p text:style-name="Standard">
Hello <text:span text:style-name="T1">there</text:span> you
</text:p>

In this case one believes that the spaces are significant. Abiword's
charData might be fed these during a load:
"Hello "
"there"
" you"

The spec goes:
2) The character data of the paragraph element and of all descendant
elements for which the OpenDocument schema permits the inclusion of
character data for the element itself and all its ancestor elements up
to the paragraph element, is concatenated in document order.

3) Leading " " (U+0020, SPACE) characters at the start of the resulting
text and trailing SPACE characters at the end of the resulting text are
removed.

So it would seem that accruing m_charData in the charData() method and
treating the removal of leading and trailing whitespace in flush() might
be more appropriate. The charData() method should probably fold anything
in
Z = { U+0009, U+000D, U+000A )
to the SPACE = U+0020 char as per the spec, and maybe collapse
contiguous sequences of 2+ spaces to 1 space too.

The folding of "other" whitespace into spaces and the collapsing must be
done in charData() needed so that "text:line-break" and "text:s" code
can still simply work on m_charData. If folding/collapsing was delayed
to flush() then those particular cases would have their effect ignored.

This does leave the sticky case of a text:s at the start of a text:p.
The spec makes it hard to tell if one or more text:s at the beginning of
a text:p should actually stick around or not. It seems not, looking at
the following quote from the spec 6.1.3:
    The <text:s> element is used to represent
    the [UNICODE] character " " (U+0020, SPACE).

If this was a special case for leading whitepsace in a text:p then I
would imagine that 6.1.2 would explicitly mention it.

Of course, once I'm somewhat happy I'll check 11847 against 2.9.1 and my
modified trunk.

On Mon, 2011-07-18 at 11:04 +1000, Martin Sevior wrote:
> Hi Ben,
>
> Yes, I did write that code! it was revision 29316 with the commit comment:
>
> "Properly fix bug 11847 Spaces inserted before paragraphs."
>
> So when you do your fixes, check that bug 11847 remains fixed :-)
>
> Cheers
>
> Martin
>
> On Mon, Jul 18, 2011 at 10:50 AM, Martin Sevior <msevior@gmail.com> wrote:
> > Hi Ben,
> >
> > This seems vaguely familiar to me but revision 29384 appears t be a
> > commit by uwog that cleans up a bit of code. As far as I can tell
> > if(!m_bContentWritten) {...} was present before 29384.
> >
> > I think I might have put that code in place but I can't recall the
> > reason. I'll do a bit of digging with viewVC
> >
> > Cheers
> >
> > Martin
> >
> >
> > On Sun, Jul 17, 2011 at 4:58 PM, Ben Martin
> > <monkeyiq@users.sourceforge.net> wrote:
> >> Hi,
> >> I dug into the loading of hand crafted ODT files in 2.9.1 and have
> >> filed a bug 13112 [1]. I've been looking into fixing the bug, but also
> >> don't want to step on anyone's toes in doing so.
> >>
> >> The gist of what I see being the problem is a special case in void
> >> ODi_TextContent_ListenerState::charData() where
> >> if(!m_bContentWritten) {...} is called. I'm not sure what use case
> >> prompted this special case to be added, so I don't want to blindly
> >> change it without discussion first.
> >>
> >> It seems from "6.1.2 White Space Characters" of the spec [2], in
> >> particular page 120 of [3], that leading and trailing whitespace is to
> >> be removed, and internal whitespace is to be collapsed. ie, a stream of
> >> 2+ SPACE to be replaced with a single SPACE.
> >>
> >> Prior to collapsing anything in
> >> Z = { U+0009, U+000D, U+000A ) is first replaced with
> >> SPACE = U+0020.
> >>
> >> The current code, when given something like the following will preserve
> >> the newline after magic;
> >>
> >> <text:p>
> >> I am a magic
> >> and special wizard<text:span>....
> >> </text:p>
> >>
> >> This is because the entire string "\n I am a magic\n and special
> >> wizard" will be passed to charData() on the first call when
> >> m_bContentWritten == false;
> >>
> >> FWIW the special case was brought in at revision 29384.
> >>
> >>
> >> [1] http://bugzilla.abisource.com/show_bug.cgi?id=13112
> >> [2] http://docs.oasis-open.org/office/v1.2/cos01/
> >> [3]
> >> http://docs.oasis-open.org/office/v1.2/cos01/OpenDocument-v1.2-cos01-part1.pdf
> >>
> >>
> >
>

Received on Mon Jul 18 10:32:01 2011

This archive was generated by hypermail 2.1.8 : Mon Jul 18 2011 - 10:32:01 CEST