Re: Crawling and indexing very slow

Ameya Aware Thu, 31 Jul 2014 12:04:41 -0700

Hi,

i have modified code a little to add different metadata fields such as
below (FileConnector.java):


                    data.addField("created", new
Date((attr.creationTime().toMillis())));
                   data.addField("last_accessed", new
Date(attr.lastAccessTime().toMillis()));
                    data.addField("last_modified", new
Date(file.lastModified()));
                    data.addField("size", file.length());


which are being passed to Solr.

Now can i stop MCF from reading a file and sending that content and just
passed above information to Solr?


Thanks,
Ameya


On Thu, Jul 31, 2014 at 2:57 PM, Karl Wright <[email protected]> wrote:

> Hi Ameya,
>
> The file system connector does not retrieve any metadata for a document at
> all.  So I'm not sure what metadata you are talking about.
>
> Karl
>
>
>
> On Thu, Jul 31, 2014 at 2:44 PM, Ameya Aware <[email protected]>
> wrote:
>
>> So the thing here is i am not looking for any data or content of any of
>> files. I am just interested in metadata of file.
>>
>> So i thought it should be possible to not read any file and just get
>> metadata of file and give to Solr.
>>
>> This should save lots of time.
>>
>> Is it possible to do this?
>>
>> Thanks,
>> Ameya
>>
>>
>>
>> On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Ameya,
>>>
>>> (1) Please look at the Simple History report.  Note what kinds of
>>> documents are being fetched, what kinds are being indexed, and how long it
>>> is taking.  I have noted from your previous posts that you seem to be
>>> indexing a lot of very large EXE files.  This is useless and you should be
>>> excluding them.
>>>
>>> (2) Please look in the manifoldcf.log file for evidence that fetches
>>> and/or Solr indexing requests are being retried due to errors.  It doesn't
>>> take many documents being chronically retried before forward progress drops
>>> to near zero.
>>>
>>> (3) If you look into (1) & (2) and everything seems fine, it may be a
>>> misalignment between availability of several kinds of resources that is the
>>> problem.  Please get a thread dump of the agents process while it is
>>> crawling, using jstack.  Post that thread dump and we can tell you what to
>>> look at next.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> I am using filesystem connector to index my entire C drive using Solr
>>>> as output connector.
>>>>
>>>> Initial 100000 documents were crawled and indexed successfully in
>>>> couple of hours but after that indexing slowed down badly (around 15-20
>>>> documents per min).
>>>>
>>>>
>>>> I am not able to figure out whether there is issue with MCF or Solr.
>>>>
>>>>
>>>> Can you advice me how to proceed with this?
>>>>
>>>>
>>>> Thanks,
>>>> Ameya
>>>>
>>>
>>>
>>
>

Re: Crawling and indexing very slow

Reply via email to