Hmm, you know, I don't even know what
a "row" means when importing XML. But
let's talk about importing XML. As far as
I know, unless you use XSLT to perform
a transformation, Solr doesn't import XML
except as well-formed Solr documents,
some form like:
<add>
<doc>
   <field name="blah">value</field>
</doc>
</add>

If you're importing anything else, I don't think
Solr understands it at all... So what does
your "funky XML document" look like?
What, if any, errors are reported in your Solr
logs?

Also, it's surprisingly easy to debug Solr
when it runs. In IntelliJ, all it involves
is creating an application and you tell
it to add a "remote" application and
it'll give you the parameters you need to
specify when you start your Solr. From there
you just invoke your Solr instance with those
parameters and connect remotely. I took the
entire source tree for the Solr I was using
and compiled it (ant example) and it was
easy. So you might get more mileage
out of debugging in Solr rather than logging, but
that's a guess.

Best
Erick

On Sat, Oct 1, 2011 at 6:17 PM, Pulkit Singhal <pulkitsing...@gmail.com> wrote:
> ====
> The Problem:
> ====
> When using DIH with trunk 4.x, I am seeing some very funny numbers
> with a particularly large XML file that I'm trying to import. Usually
> there are bound to be more rows than documents indexed in DIH because
> of the foreach property but my other xm lfiles have maybe 1.5 times
> the rows compared to the # of docs indexed.
>
> This particular funky file ends up with something like:
> <str name="Total Rows Fetched">25614008</str>
> <str name="Total Documents Processed">1048</str>
> That's 25 million rows fetched before even a measly 1000 docs are indexed!
> Something has to be wrong here.
> I checked the xml for well-formed-ness in vim by running "!:xmllint
> --noout %" so I think there are no issues there.
>
> ====
> The Question:
> ====
> For those intimately familiar with DIH code/behaviour: What is the
> appropriate log-level that will let me see the rows & docs printed out
> to log as each one is fetched/created? I don't want to make the logs
> explode because then I won't be able to read through them. Is there
> some gentle balance here that I can leverage?
>
> Thanks!
> - Pulkit
>

Reply via email to